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Preface 


This is an introduction to probability theory, designed for self-study. It covers 
the same topics as the one-semester introductory courses which I taught at 
the University of Minnesota, with some extra discussion for reading on your 
own. 

Probability theory is certainly useful. But how does it feel to study it? 
Well, like other areas of mathematics, probability theory contains elegant 
concepts, and it gives you a chance to exercise your ingenuity, which is often 
fun. But in addition, randomness and probability are part of our experience 
in the real world, present everywhere and yet still somewhat mysterious. This 
gives the subject of probability a special interest. 

With self-study in mind, detailed solutions are given for all the exercises 
here. The exercises are mixed in with the exposition, and you are encouraged 
to solve them (on paper) as you read the theory. To get the benefit of 
an exercise, please work it out, or attempt it seriously, before reading the 
solution. Tackling at least some of the exercises is essential for learning. 

Many facts are stated as numbered lemmas or remarks, often with de- 
scriptive names. This adds some noise, but should help in following the train 
of thought on your own. If a proof is given, the purpose is to clarify con- 
cepts, and all details are explained. Proofs are always optional in this book, 
but readers are encouraged to work at them, since proofs are one of the 
ways in which we internalize mathematical ideas. Internalizing ideas means 
making them part of our thinking, rather than leaving them as recipes from 
an outside source. Solving problems, working through examples, and think- 
ing about the physical meaning of concepts are other ways of internalizing 
mathematics. 

When reading this book on a computer (which is the intended way) you 
can use links to hop back and forth between exercises and solutions, as well 
as to follow references to equations and theorems. There is a large table of 
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contents and a large index with links to topics and definitions. You can use 
the index to quickly locate things that seem unfamiliar. (Most pdf viewers 
can return from following a link, coming back to the previous spot. This saves 
time. There may be a button to return, or a keystroke like “ctrl-left-arrow” 
or “alt-left-arrow” .) 

The order of the chapters is fairly logical, but a different order might be 
just as natural. Depending on your interests, some chapters can be omitted, 
or read quickly. Skipping around and skimming are always allowed. 

Learning mathematics always requires some “intense solitary thought”, 
but it is also a human activity. If you have an opportunity to discuss your 
work and share ideas with others, try to do that. There are many good 
textbooks on probability theory, and dipping into another book can be very 
stimulating, especially if you find a different approach to a topic. 

It is just possible that there are a few misprints. Corrections and sugges- 
tions will be gratefully received at probabilitybook@gmail.com. I particularly 
wish to thank Larry Susanka, who contributed many insightful comments on 
probability. 

This book is dedicated to all the participants in my probability classes. 
Thanks for listening! 


John Baxter 


September 2023 
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Chapter 1 


Probability and Events 


In this chapter we try to explain the real-world background for probability. 
The discussion does not make much use of mathematics, and can be read 
quite rapidly. After working through later chapters, readers may find it 
worthwhile to look over this introduction again, and compare it with the 
precise statements of mathematical probability. 


1.1 Common sense probability 


Events in the real world are often unpredictable, and happen without clear 
causes. Such events are said to be random. To deal with randomness we 
all use “common sense probability”, and we do so with little or no use of 
mathematics. For example, no one needs to study probability theory to 
decide whether it is safe to cross the road. But the concepts in probability 
are of interest, and mathematical probability theory is used widely in science 
and industry. In this book we are studying mathematical probability theory. 
We will build upon our understanding of common sense probability. 

Can we give a simple definition of the concept of probability? A simple 
definition of a concept would be one that is expressed in terms of other 
concepts. But some concepts seem to be so basic that they cannot be given 
this sort of definition. For example: we all have some understanding of the 
geometrical concept of three-dimensional space, but if someone asked you to 
give a simple definition of space, what would you say? We seem to build 
up our intellectual understanding of space gradually, not through a simple 
definition, but through the use of this concept as we experience the world 
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around us. 
Is probability like that, or not? Certainly probability is a very different 
property from space or time. But let’s try an experiment. 


Box 1 Box 2 


Figure 1.1: Box 1 and Box 2 


Example 1.1 (The two boxes). Imagine that someone presents you with 
a choice of two boxes, Box 1 and Box 2. You cannot see inside either box, 
but you are allowed to choose one of the two boxes, and then reach into that 
box and take out one object. You must select an object in the box without 
looking, so you have no control over which object you obtain from the box. 

You know that Box 1 contains six objects. One is a valuable diamond, 
and the other five objects are merely stones from the road. And you also 
know that Box 2 contains six objects. Five of these objects are valuable 
diamonds, and the remaining one is a stone without value. See Figure 

Remember, you must choose either Box 1 or Box 2 before you make your 
selection from the box. After you make your selection, you will be holding 
one object in your hand, either a valuable diamond or a worthless stone. 
Assuming that you wish to get rich, which box should you choose? 

The unanimous answer is surely “Box 2”. That is an example of common 
sense probability. But now comes the challenge: explain why you would 
choose Box 2, without using the word “probability”, and without using any 
synonym, such as “chance” or “odds” or “likelihood” . 
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1.2. Probability as belief? 


An answer to this challenge might help in formulating a definition of 
probability. However, in Example[1.1}we didn’t state exactly what constitutes 
an explanation, so someone might respond by simply saying “Box 2 gives 
you more ways to win”. Such an answer certainly identifies a difference, but 
doesn’t explain why this difference matters. So one could debate whether 
this is a sufficient explanation. But it doesn’t give us a definition. 

At any rate, as far as your author knows, no one has ever given a simple 
definition of probability. And that’s ok! In this book we will build up our 
understanding of probability through examples and mathematical properties, 
drawing on our experience with probability in the real world. 


Exercise 1.1. Consider a more complicated version of Example Keep 
Box 2 the same, but change Box 1 to have 10 diamonds and 90 stones. In 
that case Box 1 certainly gives you “more ways to win”. Is Box 2 still a 
better choice? 


1.2 Probability as belief? 


One can regard probability as a way of measuring what might be called 
“degree of belief”. To a possible future event, we assign a number between 0 
and 1, called the probability, which expresses our confidence that the event 
will happen. 

Probability 1 for any event means we are certain the event will happen, 
probability 0 means we think it is impossible. Probability values which are 
between 0 and 1 mean we are not sure. 

Our common sense probability judgments are based on knowledge. Your 
knowledge might be different from mine, and as a result we might assign 
very different likelihood to the same possible event. So it is natural to try to 
describe probability as a belief inside your head, i.e. something subjective. 
Is this a sensible definition? 

Defining probability as “degree of belief” turns out to be an elegant way 
to think about the formulas of probability theory. And it is not wrong, just 
insufficient. We must still try to connect those probabilities inside our heads 
with the external world, and explain the brutal fact that correct assessments 
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of probability tend to keep you alive, and incorrect assessments of probability 
tend to kill you. A practical connection between probability and the real 
world will be stated as Probability Fact (1.1 after some discussion of concepts. 


1.3. Experiments 


We will use the word “experimental situation” as a convenient general term to 
describe a situation in which you know the setting but may have incomplete 
information. For brevity we might also just say “experiment” to describe 
this situation. 

For example, perhaps someone will take an action, or has taken an action, 
and the result of this action is unknown to you, although you may learn the 
result later. We are calling the situation and the action an experimental 
situation, even though it need not arise from something you do in a scientific 
laboratory. It might just be tossing a coin, and indeed a coin toss is one of 
our standard examples. 

The result of the experiment will often be called the outcome. 

A real experiment takes place at a definite place and time, is carried out 
by particular people, and so on. Most of those details are irrelevant when 
calculating a probability. 

When we talk about the outcome of the experiment, we usually only mean 
the features which are essential for our purposes. So for a coin toss we tersely 
say that the outcome is either a head or a tail. 


1.4 Repeated coin tosses 


Think more about tossing a coin. 

We are not surprised that the result of tossing a coin is unpredictable. It 
seems that small changes, even ones that are too small to notice, can have 
an effect on the result of the toss. The coin is usually spinning in the air, 
and if it spins just a little faster, or we toss it just a little higher, that can 
change the result. Even if we try to toss the coin the same way each time, 
for most people there seems to be some kind of “shakiness” in the motions 
of their arms and hands. Perhaps that leads to unpredictability. 

Suppose someone asserts, in everyday language, that a particular coin 
has probability .55 of coming up heads when tossed. This number .55 does 
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not help very much in predicting what will happen the next time we toss the 
coin! What is such a probability value good for? 

It is perhaps surprising that probability does tell us something useful 
in this situation, provided that we are willing to toss the coin many times. 
Given that the probability of a head is .55, we expect that if we toss the 
coin 10000 times, it is likely that approximately 5500 of the results will be 
heads, although it is unlikely that exactly 5500 heads will be obtained. Please 
note that there are two vague words in the previous sentence: “likely” and 
“approximately”. And yet, despite the vagueness, this is a key insight about 
the world. 

The concept of the frequency gives us a convenient way to express what 
probability tells us. Here’s the definition of frequency. We’ll state it for the 
coin-tossing situation, but it applies to any experimental situation. 


Definition 1.2 (Frequencies). When the coin is tossed N times, and heads 
occur on M of the tosses, we say that the frequency f of heads is given by 


M 
{= WV (1.1) 
Thus the frequency of heads is the fraction of times that a head is obtained. 

Our interpretation of the probability .55 is: if we toss the coin many 
times, we are confident that the frequency of heads will be approximately 
.55. This is an example of the “Frequency Interpretation of Probability”. 
The general statement is given below in Probability Fact 

Readers will be familiar with this way of thinking about probability. We 
expect that a baseball player with a high batting average is more likely to 
get a hit than someone with a low average, and so on. 

But perhaps we should try to be surprised, just for a moment! Suppose 
we toss a coin 10,000 times, and get 5439 heads. If we toss the coin another 
10,000 times, we certainly don’t know what will happen on any particular 
toss. And yet, even if no one told us the probability of a head with this coin, 
we feel confident that the total number of heads the next time will not be 
too different from what was obtained the first time! So in this limited sense 
we can predict the future, and that is still enormously helpful. 

Try to imagine a world in which the frequency in one series of tosses 
told us nothing about the frequency in the next series of tosses. That world 
would be far more chaotic than the one we live in. Planning and decision- 
making might be so difficult that we could not survive. And the concept of 
probability would not exist. 
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1.5 Selecting from the box 


Return to the experiment described in Example Imagine that you are 
able to repeat this experiment many times. 

Suppose that on each repetition you choose Box 2, and then remove one 
object from Box 2, which must be either a diamond or a stone. What do you 
think will happen? 

Each repetition of this experiment is supposed to be a fresh start, with 
no connection to the results of the previous repetitions. Box 2 contains five 
diamonds and one stone, each time. We can picture the box as being shaken 
vigorously each time before the object is selected, so we have no idea of the 
positions of objects inside. And we should assume that the diamonds and 
the stone are indistinguishable by touch, so we have no control at all over 
which object is selected. 

In a long series of repetitions of this experiment, very likely the following 
will be true. You will obtain a diamond from the box in approximately 5/6 
of the repetitions, and in approximately 1/6 of the repetitions you will get 
the stone. 

Of course, if Box 1 were chosen for each repetition of the experiment, 
we would expect that approximately 1/6 of the time a diamond would be 
obtained. Since 1/6 < 5/6, Box 2 is a better choice than Box 1 because it 
gives a larger frequency of success. 

To express our thoughts more concisely, we can use probability language 
instead of frequency language. 

Thus we would say that when selecting an object from Box 2, the probabil- 
ity of success is 5/6, and when selecting an object from Box 1, the probability 
of success is only 1/6. This is how most people would justify choosing Box 2 
before selecting an object. 

But we must admit that our analysis deals with many repetitions. If you 
only perform the experiment once with Box 2, it can certainly happen that 
you will obtain the worthless stone. In practical experience, we all know that 
a good strategy can sometimes be defeated by luck. 


1.6 The frequency interpretation 


We will be talking about an interpretation for the probability of an event. 
The word “event” is used in ordinary speech, but let’s define a slightly more 
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precise usage here. 


Definition 1.3 (Events). We will use the term “event” to describe the 
occurrence or non-occurrence of a property of the outcome of an experiment. 
We often denote such an event by a letter, so for example we might speak of 
the event A. 


Remember that the concept of probability has not been given a precise 
definition, although we’ve talked about common sense probability, and we’ve 
talked about probability as a degree of belief. Rather than seeking a final 
definition of probability, we will consider what properties must govern the 
use of this concept. 

In a particular situation, one may estimate the probability of certain 
events by means of careful observation, or from general opinion, or perhaps 
from a “feeling” based on experience. Once those probability values have 
been assigned, the general laws of probability then determine the probabilities 
of many other events. 

For any event A, let us write P(A) to denote the probability that we 
assign to an event A. 

If the event is defined for a particular experiment, imagine carrying out 
the experiment repeatedly, for a total of N repetitions. Sometimes each 
experiment in the sequence of repetitions is called a “trial”. The repeated 
experiments are distinct actions, but are supposed to take place in similar 
settings. 

What does it mean to say that settings are “similar”? Settings which look 
similar may have subtle differences that influence the outcomes which we 
observe. This means that we must think hard when applying probability to 
real-world settings, and use our practical experience as well as mathematical 
theory. But we won’t worry about that right now. 

Any physically meaningful probability value must be consistent with the 
following. 


Probability Fact 1.1 (The frequency interpretation of probability). 
For an event A, the observed frequency of occurrence of A, in any sufficiently 
long sequence of repetitions of similar experimental situations, will likely be 
close to P(A). 
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In applications, we can use the frequency interpretation to find a proba- 
bility that we don’t know, and to predict a frequency from a probability that 
we do know. 

If the frequency of an event in repeated experiments does not match the 
probability that we have assigned to the event, that indicates an error. 


Remark 1.4 (Do we have a definition of probability here?). The 
answer to this question depends on your standards for definitions. However, 
it must be noted that the frequency interpretation of probability cannot 
provide a precise definition of probability. Since the word “likely” is used, 
a definition based on the frequency interpretation would be circular, since 
you would have to already know at least something about the meaning of 
probability, in order to understand its definition! Furthermore, the statement 
is vague. Look at those weasel-words “sufficiently long” and “close”, in the 
statement. If you want the observed frequency to be, say, within 1% of the 
probability, how long is “sufficiently long”? 


And yet, despite its theoretical deficiencies, the frequency interpretation 
is the most important practical statement we can make about the connec- 
tion between mathematical probability statements and physical probability 
statements. Whatever assumptions we make later about mathematical prob- 
abilities must be consistent with the frequency interpretation. 


Example 1.5 (Events that are certain and events that are impos- 
sible). For some experiment, let A be an event which is certain to occur, 
and let C' be an event which is impossible. Then we say that P(A) = 1, and 
P(C) = 0. Are these definitions forced on us by the frequency interpretation? 

The frequency interpretation says that if we repeat the experiment many 
times, the measured frequency of A is likely to be close to P(A). 

Consider N repetitions of the experiment. Since A is certain, it will occur 
N times, giving an experimental frequency of N/N = 1. The frequency 
interpretation says that this experimental frequency is likely to be close to 
P(A) when N is large. If P(A) were different from 1, say P(A) = .8, it 
wouldn’t be likely at all that the experimental frequency was close to P(A). 
So yes, the value of P(A) in this case must be equal to 1. 

In the same way, the frequency interpretation requires that P(C) = 0. 
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1.7 Streamlining probability arguments 


Although discussions of frequencies can be enlightening, it is usually faster 
to work with probabilities. The following is a key property of probability. 


Probability Fact 1.2 (Probabilities are additive over cases). Let Dj,... 


be events for a some experiment. 

Suppose that events D,, ..., D, are mutually exclusive, meaning that 
at most one of the events D; can occur. Let A be the event that one of 
D,,..., Dx occurs. This means that if any of the events D,,..., Dz occurs, 
by definition A occurs. Then: 


P(A = Pi)... PD, ). (1.2) 


The justification of Probability Fact is based on thinking about fre- 
quencies. The details are given in Section {1.8} 


An important special case of Probability Fact/1.2]is the situation in which 
D,,..., Dx cover all possibilities, meaning that one of these events always 
occurs. In that case we have: 


P (Di) +... +P (D,) =1. (1.3) 


Why does equation follow from equation (1.2)? Well, since Dj,..., Dy 
cover all possibilities, A always happens! (It’s a really boring event.) Just 
as in Example we then conclude immediately that P(A) = 1, so equa- 
tion (1.2) turns into equation (1.3). 


Example 1.6 (Again with the boxes!). Return again to the problem 
of choosing one of two boxes (Example [1.1). In Section a reason for 
choosing Box 2 was stated, based on frequencies. We talked there about 
repeated experiments, but we didn’t actually perform the experiments, we 
just thought about them. One of the benefits of probability theory is that it 
saves a lot of work to reason theoretically, if we can do so carefully. 

Probability Fact [I.2]lets us easily analyze the probabilities in this situa- 
tion, as follows. 
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(#3) 
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Qa @ Qs 


Figure 1.2: Box 1 and Box 2 again 


First, let’s mentally attach ID numbers to each of the objects in the two 
boxes. If you like, imagine painting little numbers on each object in the 
box. The ID numbers are invisible to observers, and they won’t change the 
behavior of the experiment at all, but this will make it easier to refer to 
different objects. See Figure [1.2] 

Suppose we decide to choose an object from Box 2. Let D; be the event 
that object z is selected. 

Our first step is to find P(D;j). 

Since no object is favored, all the probabilities P(D,),...,P(De) are 
equal. 

By equation [1.3] P(D,)+...+P(Ds) =1, so 


1 
P(D;) = & fori =1,...,6. 


Let A be the event that the selected object is a diamond. We want to 
know the value of P(A). 

Looking at the labelled objects, we see that the diamonds are objects 
1,2,4,5,6. Hence A is the event that one of the events D,, Do, D4, Ds, Dg 
occurs. 


By equation (1.2), 


P(A) = P(D,) + P(D2) + P(D:) + P(Ds) + P(Ds) = 7 (1.4) 


In Section we asserted that choosing from Box 2 should give success 
appoximately 5/6 of the time. By the frequency interpretation, this is 
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what equation tells us, so we have a justification based on Probability 
Fact 

Does this argument really need to be expressed in terms of probabilities? 
Couldn’t we just talk from the beginning about frequencies for the events 
D;? Sure, but arguing carefully about frequencies takes more words, and so 
risks imprecision. Using probabilities seems more pleasant. 


Notice that in Probability Fact [1.2itself there is no assumption that all 
the events D; have the same probability. That assumption happens to be true 
in Example and it’s useful. But Probability Fact is true whenever 
the events D,,...,D, be mutually exclusive. 


Remark 1.7 (Equal probabilities?). The discussion in Example|1.6]seems 
convincing, but let’s look harder at one step. A claim was made: 


“Since no object is favored, all the probabilities P(D,),...,P(De) 
are equal.” 


Our practical experience supports this claim, and it seems natural, but it 
should be noted that here we are building in an extra assumption about the 
world. It is related to the comment made at the end of in Section We 
said there that the world is not completely chaotic: if we toss a coin many 
times, and then perform a second sequence of tosses with the same coin, 
we expect that the frequency of heads in the second series of tosses will be 
roughly consistent with the frequency of heads in the first series. 

In Example there are six different objects in Box 2, but they are 
the same in any way which affects the results of the experiment. For that 
reason, we expect that in a long series of trials, each object will be selected 
with roughly the same frequency. In the language of probability, we think 
that each of the six objects has the same probability of being selected. If this 
assumption turns out to be false, we will conclude that we did not understand 
the experiment. 
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1.8 Justifying the additive property of prob- 
ability 


Once we have Probability Fact in our toolbox, we don’t need to keep 
talking about frequencies, at least not so much. But we will use the frequency 
interpretation of probability in this section to justify Probability Fact 
once and for all. 

Suppose that the experiment is repeated many times, say N times. For 
each repetition of the experiment, record whether or not event D; occurred. 
Let M; be the number of times that event D; occurred. By definition, the 
frequency f; with which D; occurred is given by 


fi= W? (1.5) 


and by the frequency interpretation, if N is large it is very likely that 
fi = P(Di), (1.6) 


where we use the symbol “=” to indicate approximate equality. 

Let M be the number of times that the event A occurred. 

By the definition of A, to say that A occurred is to say that one of the 
events D,,..., Dx occurred. 

We are assuming that the events D,,..., D, are mutually exclusive. When- 
ever one of these events occurs, the rest of the events do not occur. So each 
occurrence of A contributes 1 to a particular M;, and contributes nothing 
to the others. Thus each occurrence of A contributes exactly 1 to the sum 
M,+...+ M,. Hence 

M=M,+...+ Mg. (1.7) 


This is not a probability statement. It is a counting statement, and our 
logic shows that it is precisely true. Dividing equation (1.7) by N shows that 
the frequency with which event A occurred is given by 


—M_ M,4+...+ Mz 
NN N 


f =fit...t fee (1.8) 


By the frequency interpretation, if N is large it is very likely that 
fx P(A), (1.9) 
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and 
jie PD). (1.10) 


This gives equation (1.2). 


Exercise 1.2. Just for fun: suppose that we are given two events Dj, D2, 
and we do not assume that they are mutually exclusive. Consider repeating 
the experiment N times, and let M; be the number of times that D; occurs. 

Let A be the event that at least one of the events D,, D2 occurs. Let M 
be the number of occurrences of A. 


(i) Show by an example that it need not be true that M = M, + Mo. 


(ii) Can you state a condition that the three numbers M,,M2,M must 
satisfy? 


1.9 Some simple examples 


Example 1.8 (One coin toss). For a coin toss there seem to be only two 
interesting events, the event H that the result is a head, and the event T’ 
that the result is a tail. 

A coin is said to be fair if the probability of obtaining a head is equal to 
the probability of obtaining a tail. Gamblers are typically expected to use 
fair coins in their games. 

A real coin may be fair or unfair. For any coin, 


P(H)+P(T) =1. (1.11) 


Exercise 1.3. How is equation (1.11) related to Probability Fact 
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Example 1.9 (Rolling Dice). Instead of thinking about tossing a coin, 
let’s consider rolling a die. Most people have played games in which a die is 
rolled, or perhaps two dice are rolled. The die is a cube, so it has six faces. 
Rolling the die a single time is an experiment with six possible outcomes. 
The outcome of the experiment is the number of dots on the uppermost face 
of the die when it settles. The possible outcomes are 1, 2,3, 4, 5,6. 

A die is said to be fair if all the outcomes 1,2,3,4,5,6 have the same 
probability. 

One possible event when rolling a die is the event that the outcome is 5. 
We might call this event A. The event A only occurs when the outcome is 5. 

Another possible event is the event that the outcome is an odd number. 
We might call this event B. B is described by a property that three of the 
possible outcomes have. If the die gives a 1, a 3, or a 5, we say that the event 
B occurred. 


Exercise 1.4. When rolling a fair die many times, what fraction of the rolls 
(approximately) will result in an odd number? 


Remark 1.10 (Comparing experiments). Rolling a fair die is physically 
different from the experiment of selecting an object from a box containing 
six possible choices, as described in Section [1.5] and Example However, 
in both cases there are six basic events, everything can be described in terms 
of those events, and the probabilities of the basic events are equal to 1/6 
in both cases. Thus one can translate any problem dealing with one of 
these experiments into a similar problem dealing with the other, and the 
corresponding numerical answers must agree. 

This observation applies to the fair case. Unfair dice certainly exist, 
perhaps due to variations in the density of the material. On the other hand, 
there isn’t an obvious simple way to modify the experiment in Section 
in order to have different probabilities for the six basic events defined in 


Example 
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Exercise 1.5 (Lottery tickets). This book does not advocate buying lot- 
tery tickets. But we can think about them without making a purchase. Sup- 
pose that a company offers n lottery tickets for sale, where n may be a large 
integer. Exactly one of these tickets is the winning ticket, and the purchaser 
will receive a large sum of money. The remaining tickets are worthless, and 
of course we don’t know which ticket is the winner. You have purchased one 
ticket. Let W be the event that your ticket turns out to be the winner. 


(i) Let P(W) be the probability of W. Find P(W). 


Note that the experiment of Section [1.5] using Box 1, essentially solves 
this problem for n = 6. 


Common sense probability likely gives you the answer as well. 


(ii) A certain wealthy gambler buys & lottery tickets, where k may be any 
number less than or equal to n Let G be the event that the gambler 
wins the lottery with one of purchased tickets. Find P(G). 


Remark 1.11. Let W and G be the events described in the lottery of Ex- 
ercise Suppose that n is equal to 10°. Is P(W) a physically meaningful 
probability value? Think about deciding whether the price of the ticket is 
reasonable. P(W) is certainly relevant to that decision. 

We found P(W) theoretically, using Probability Fact Suppose that 
you wish to use the frequency interpretation to test the validity of the value 
calculated for P(W). In principle this can be done. However, the whole lot- 
tery is part of the experiment, and a ridiculously large number of repetitions 
of the lottery would be required to accurately measure the frequency with 
which W occurs. 

On the other hand, when & is comparable in size to n, the value of P(G) 
could be tested experimentally with fewer repetitions. This is indirectly a 
test of P(W). 
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Example 1.12 (Sampling from a population). One can think about 
making a random selection from a population as an experiment. Pollsters do 
this all the time, of course. 

It’s easier to think about a population of jelly beans than about a popula- 
tion of people, so suppose you have a large bowl containing many jelly beans, 
some yellow and some red. In this experiment we assume that there are n 
jelly beans altogether, k yellow ones and n — k red ones. In the experiment, 
you randomly select exactly one bean, and record its color. 

Specifying the experiment includes specifying the actual number of beans 
of each color that are in the bowl. 

We prepare for the experiment by stirring the jelly beans vigorously, so 
that the beans in the bowl are thoroughly mixed. That is not the experiment, 
just part of the setup. 

Let C be the event that the selected bean is yellow. We would like to 
know P(C), that is the probability that the selected bean is yellow. 

Calculations in the setting of this experiment will be similar to calcula- 
tions for the lottery described in Exercise This seems especially clear 
if we think of the number of the winning lottery ticket as being randomly 
chosen after the tickets have been sold. The event that your own ticket is 
the winner corresponds to the event that a particular jelly bean is selected. 
The set of tickets bought by the wealthy gambler in part (ii) of Exercise 
would correspond to the set of yellow jelly beans in the bowl. 


1.10 Probability distributions 


One usually wants to know the probabilities for the possible outcomes of 
an experiment, and perhaps for some of the possible events. Here’s some 
standard terminology. 


Definition 1.13 (Probability distributions). A rule which assigns prob- 
abilities for some family of related events is called a probability distribution 
for the events. The probability which the rule prescribes for an event A is 
usually denoted by P(A). 
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A simple example of a probability distribution is a rule which gives the 
probability of each possible outcome of an experiment. We might find such 
a distribution experimentally, as in the next section. 

The use of the word “distribution” in Definition [1.13] may reflect the fact 
that the probability values for all the outcomes must add up to one. In that 
respect, assigning probabilities to various possible events is a bit like splitting 
up a unit quantity of material and distributing it to various locations. 

The phrase “family of related events” in Definition [1.13] is not precise. It 
might refer to all events, or to some limited collection of events which are 
of interest at the moment. We will see examples of distributions in specific 
settings later. 


1.11 Collecting statistics 


It’s more fun to talk about frequencies than to actually perform experiments 
and measure them. But perhaps we should take a moment to look at some 
examples. 


Statistical data 


We will refer to experimental data which is systematically recorded and tab- 
ulated as statistical data (and see [9] for a discussion of correct grammatical 
usage of the word “data”!). 

General features of such data are referred to as statistical properties. If 
our data is the result of a sequence of repeated experiments, one statistical 
property is the frequency of a particular event. Of course one can calculate 
many other statistical properties in this setting, such as the frequency of 
obtaining the same outcome twice in a row, or the degree of variation in the 
data, etc. But at present we will just focus on the frequency. 


Collecting data to learn probabilities 


In the case of the experiment of rolling a die, a probability distribution gives 
the probability of each possible result. For a fair die, the probabilities for 
the values 1, 2,3, 4,5,6 are 1/6, 1/6, 1/6, 1/6, 1/6, respectively, but of course 
a die may be unfair. 
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Suppose we have a die, but we don’t know the probability distribution 
associated with this die. In this situation, one might roll the die repeatedly 
and use the frequency interpretation to get an idea about the distribution. 

Imagine rolling the die 20 times, recording the results. (To save time, we 
can use a computer to simulate rolling the die. This means that a computer 
program produces numbers that have similar statistical properties to the 
results of performing the actual repeated experiments. It’s not obvious that 
this can be made to work, but it does work, pretty well.) 

For a particular sequence of 20 trials, the outcomes happen to be 


D555 69 2ADT 1A SD 4A OG 


You can check that the counts for the outcomes 1, 2,3, 4, 5,6, are 3, 7, 1,4, 3, 2. 
Thus the frequencies for outcomes 1, 2, 3,4, 5, 6, are 0.15, 0.35, 0.05, 0.2, 0.15, 0.1, 
respectively. See Figure [1.3a| 

These numbers are not probabilities, of course. They are just numbers 
that tell us something about the recorded outcomes for a particular experi- 
ment. But if we think about making additional rolls of the same die, we can 
hope that these numbers give us some idea of the probability of each possible 
outcome. 

That hope is based on the frequency interpretation of probability, which 
says that the probability of obtaining a particular value on one roll of the die 
should be similar to the observed frequency for that value, when we have a 
long sequence of repeated trials. 

However, a sequence of 20 trials does not seem long, especially when 
there are six possible outcomes. So it seems rash to draw a conclusion about 
probabilities based on these frequencies. 


Longer sequences of trials 


Let’s try to get a more accurate estimate for the probability of each possible 
result. If we roll the die 100 times, recording the results, the frequencies for 
1, 2,3, 4, 5,6 are 0.11, 0.33, 0.11, 0.19, 0.07, 0.19, respectively. See Figure[1.3b] 
Even 100 repetitions is not very many. So let’s do more repetitions. 
If we roll the die 1000 times, recording the results, the frequencies for 
1, 2,3, 4,5, 6 are 0.09, 0.313, 0.09, 0.23, 0.097, 0.18, respectively. See Figure[1.3c] 
This is fairly consistent with the results for 100 trials, but of course is 
likely to be more reliable. 
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Let p; be the probability of obtaining the value 7 when rolling this par- 
ticular die. If we have to start playing a gambling game using this particular 
die, as a practical choice we might as well assume that 


p1 = 0.09, po = 0.313, pz = 0.09, py = 0.23, ps = 0.097, pg = 0.18. 


If you happen to know that this example was made up by a person who 
likes simple numbers, then you may suspect that the actual probabilities for 
outcomes 1, 2,3,4,5,6 are .1,.3,.1,.2,.1,.2, respectively. However, in real 
world situations we should not expect such convenient values for the proba- 
bilities. 
Let’s do a sequence of 10000 trials, to check for consistency. This time we 
find that the frequencies for 1, 2, 3, 4, 5, 6 are 0.0989, 0.2984, 0.09870, 0.20070, 0.1053, 0.198, 
respectively. (See Figure [1-3d]) 
Now we feel reasonably confident that we have a good approximation for 
the probability distribution for this die. 


Remark 1.14 (Messy data!). By now you will have noticed that when ran- 
domness is involved, recorded observations seem rather messy. If we display 
all the data in a plot, we are unlikely to obtain obtain a nice neat picture. 
This is in contrast to, for example, the beautiful curves we get when plot- 
ting solutions of differential equations. We can deal with randomness but we 
cannot eliminate it. 

With this in mind, it is striking that elegant patterns of behavior do 
emerge in data associated with large random systems. The Central Limit 
Theorem of probability shows this for a long series of coin tosses ([I1]). It is 
also one of the key insights of statistical physics. 


The data for die rolls was obtained by simulation using a computer. We 
won't take time to discuss how a computer actually carries out such simula- 
tions. The next exercise asks you to consider a different kind of simulation. 


Exercise 1.6 (Simple simulations). Suppose you are thinking about some 
experiment with three possible outcomes, each of which is supposed to have 
probability 1/3. For convenience, let’s give the three outcomes labels: a, b,c. 

The physical apparatus for this experiment is complicated and expensive, 
so you won’t actually perform the experiment today. But you would like to 
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play with some statistical data corresponding to these probabilities. You can 
try to simulate this experiment using different equipment. That is, instead of 
actually doing the experiment, you will do some other experiment (perhaps 
something that is easier to perform repeatedly), which will produce the same 
values, with the same statistical properties as the real experiment. 

What matters is that your simulation is supposed to produce one of the 
labels a,b,c, with equal probability for each label. You may not have the 
equipment you need, though. Here are some cases to consider. 


(i) Suppose you have in your possession a fair die. How can you perform 
the simulation? 


(ii) Suppose you don’t have the fair die, but you have a fair coin and an 
unfair coin, and the unfair coin is known to produce a head with prob- 
ability 1/3. How can you perform the simulation? 


1.12 Brownian motion 


Our examples have been rather simple, although the principles they illustrate 
apply to very complex situations. Randomness seems to exist everywhere, 
and is almost unavoidable. 

When discussing coin-tossing in Section [1.4] we suggested that most peo- 
ple seem to have some kind of “shakiness” in their arms and hands, which 
causes the result of the coin toss to be unpredictable. One might try to ex- 
press this in a more general way by saying that the small motions of their 
arms are unpredictable, and this unpredictability then leads to unpredictable 
results for coin tosses. But then one can ask, “why are the small arm motions 
unpredictable?”. This type of questioning can be continued. It seems to lead 
us consider more and more detailed pictures of physical processes, at smaller 
and smaller scales. Randomness and unpredictability apparently exist at all 
known levels of description. 

This book has no ultimate explanation for randomness. However, to 
illustrate randomness on a small scale, and how its effects can spread, let’s 
briefly consider a famous example: Brownian motion. 
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It’s 1827, and biologist Robert Brown is peering into his microscope (|5], 
[7|). He sees little particles moving around in fluid, in a very irregular man- 
ner. The original particles come from pollen grains, but as he continues 
his observations he finds that all little particles in fluid seem to move in a 
rather similar way. They constantly change direction and do not seem to get 
“tired”. He finds that even water that has been trapped inside rocks can 
contain moving particles, and they must have kept moving during millions 
of years. Apparently the statistical properties of this particle motion do not 
change. 

What Brown saw is related to heat. Nowadays we interpret heat as dis- 
orderly motion of atoms and molecules. So let’s think about disorder. You 
can create disorder, for example by dropping something, so that the energy 
of its fall is transformed into heat when it hits the ground. You can move 
disorder around, for example when a hot object is placed in contact with 
a cold object. But it is very difficult to make a large disorderly collection 
become more orderly. 

The universe seems to be full of disorder, especially at small scales. The 
particles that Brown observed were large compared to molecules. That’s why 
he could see them. But physicists think that the motion of Brown’s particles 
is caused by collisions with molecules of the fluid which contains the particles. 

These “invisible” molecules in the fluid are moving in a disorderly way. 
We can’t predict the details of the movement of the molecules, and we think 
of their movement as random behavior. 

It’s interesting to consider how the fluid molecules interact with a particle 
that Brown observes. Any such particle will receive many impacts per second 
on all sides, from the tiny molecules. At normal temperatures the particle is 
going to be hit a lot. 

The effect of the collisions on the particle is roughly the same in all 
directions, because of the disorderly motion of the molecules. However, the 
number of impacts on each side naturally fluctuates, so that briefly one side 
of the particle receives more collisions than the other. 

We shouldn’t be surprised that there are fluctuations. Fluctuations are 
part of random behavior. If you think of tossing a fair coin many times, there 
will inevitably be periods when more heads than tails occur, just by chance. 
It all evens out in the long run, of course. 

But random fluctuations are what cause the particle movement that 
Brown observed. When more molecules hit a particle on one side than the 
other, it will move. Since the resulting particle motion is large enough to be 
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observable in a microscope, those tiny invisible molecules must have a lot of 
energy. 

By our standards molecules move rather violently! If the molecules of 
your body somehow became orderly, and all moved in a single direction, 
your body would hurtle away at a speed of hundreds of meters per second. 

Brownian motion provides us with a vivid picture of disorder. It also 
gives us an example of how random behavior on a small scale is pervasive, 
and can lead to random behavior on a larger scale. 


1.13 Solutions for Chapter 


Solution (Exercise |1.1). Yep! 
You knew, that didn’t you? The success frequency using the new version 
of Box 1 will even worse than before (1/10 rather than 1/6). 


Solution (Exercise [1.2). (i) The question does not require us to assume 
that D, and Dp» are distinct events. If Dg = D,, then M = M, = Mo, so M 
is not equal to M, + Mp unless M, = 0. 

If that example seems artificial, let D, be the event that it rains on 
Monday, and let D2 be the event that it rains on Tuesday. Suppose that over 
a ten-week period, the following data is collected. 


Week | Monday | Tuesday 
1 rain dry 
2 dry dry 
3 dry dry 
4 rain rain 
5 dry rain 
6 rain dry 
7 dry dry 
8 rain dry 
9 dry dry 

10 dry dry 


From the table, M, = 4, Mj = 2 and M = 5. 
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The problem is Week 4, isn’t it? This week is counted both in M; and 
M,. One often calls this situation “double-counting” . 


(ii) In general, whatever the experiment, any repetition of the experiment 
which contributes to M must contribute to at least one of the quantities MM, 
and M,. Thus we must always have the relation 


M<M,4+™,. (1.12) 


Solution (Exercise |1.3}. Either a head or a tail must be obtained. Hence 
events H,T cover all possibilities. These events are mutually exclusive, since 
the coin cannot come up both heads and tails! 

By equation [1.3] with D, = H and D, =T, P(T) + P(A) =1. 


This is equation (1.11). 


Solution (Exercise 11.4) . Let D; be the event that outcome 7 occurs, for 
i=1,...,6. By equation (1.3), 


P(D,) +...+ P(De) =1. 
For a fair die, P(D,) = P(D2) =... = P(Dg), and so we have 
6P(D,;) = 1. 
Thus P(D,) = 1/6, and so P(D;) = 1/6 for each i = 1,...,6. (Yup, we used 
the same argument in Example [1.6]) 
Let B be the event that an odd number is obtained. Clearly 
B = dD, U Dz U Ds. 


By equation (1.2), 
3 61 
P(B) = P(D,) + P(D3) + P(Ds) = 5 


Using the Frequency Interpretation, we expect that in a large number of rolls, 
approximately 1/2 of the rolls will result in an odd number. 


Solution (Exercise |1.5). Suppose that each ticket has an identification 
number. 
Let D; be the event that ticket 7 is the winning ticket. 
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The events D;, 7 = 1,...,n are clearly mutually exclusive and cover all 
possibilities. 

As far as we know, no ticket is favored, and we will calculate probabilities 
based on that. Since no ticket is favored, P(D;) is the same for every j. 

By equation [1.3] P(D\)+...+P(D,) =1. 

Hence 


for every 7. 


(i) Suppose you purchased ticket t. W is the event that ticket t is the 
winning ticket. Thus W = D,, and so P(W) = 1/n. 


(ii) We can always number the tickets so that tickets 1,...,k are the ones 
that the wealthy gambler purchases. This is just to make it easier to write 
down the argument. Then 


G= Dpl.U De, 


and so by equation (1.2) we know that 


P(G) = P(D,) +... + P(D,) = o 


Solution (Exercise [1.6). (i) For each roll of the fair die, report the result 

as label a if the die gave a 1 or a 2, report label b if the die gave a 3 or a 4, 
and report label c if the die gave a 5 or a 6. 

The die will give 1 on approximately 1/6 of the tosses and the die will 
give a 2 on approximately 1/6 of the tosses. Hence label a will be reported 
approximately 1/3 of the time, which is what is desired. Similarly labels b 
and c will each be reported 1/3 of the time. 


(ii) Toss the unfair coin. If the coin gives a head, report label a. Otherwise, 
continue the simulation by tossing the fair coin. If the fair coin gives a head, 
report label b. If the fair coin gives a tail, report label c. 

Clearly label a will be reported on approximately 1/3 of the times you 
perform the simulation. You will report label b during approximately 1/2 of 
the times that you don’t report a. Since 1/2 of 2/3 is 1/3, this is what is 
desired. Similarly label c will be reported approximately 1/3 of the times. 
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Chapter 2 


Assumptions for probability, 
and their consequences 


In this chapter we lay out the general structure of mathematical probability. 
General statements are necessarily abstract, but the abstractions of proba- 
bility theory are fairly pleasant. Examples are given in Section [2.4] and visits 
to that section are welcome at any time. 


2.1 Abstract outcomes 


Often we want to know whether or not the result of a given experiment has 
a certain property that we are interested in. The occurrence of this prop- 
erty is what we call an “event”. The complete result of the experiment is 
called the “outcome”. There may be many possible outcomes for a given 
experiment, some of which have the property we are interested in. Calculat- 
ing the probability of an event typically requires us to consider all possible 
outcomes. With that in mind, let’s think about representing outcomes in a 
mathematical model. 

In a calculation, we necessarily restrict our attention to abstract rep- 
resentations of outcomes. These mathematical representations of physical 
outcomes will also be called “outcomes”, or perhaps “abstract outcomes” if 
we want to emphasize that these are objects of thought. 

Each abstract outcome is a mathematical object, from which all inessen- 
tial properties have been ruthlessly stripped. Thus if a botanist is experi- 
menting in breeding roses, a beautiful new plant in the real world might be 
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represented abstractly by a single letter which indicates its color. In general, 
the representation must include whatever properties of the outcome that we 
are interested in, but need not have more details. 

If we base our calculations on the possible outcomes, then the events that 
we are interested in must be represented in terms of the outcomes. When a 
physical event is defined by a certain property, we will represent the event as 
the set of outcomes which have this property. 

Is that an adequate way to represent an event? As a set? The next 
example tests this approach. 


Example 2.1 (Brown hair as a set of outcomes). Consider an exper- 
iment in which one person is randomly selected from a population. The 
person selected is the physical outcome of the experiment! 

Since the outcome is a person, the outcome has a lot of properties. One 
of the properties of the outcome is hair color. Let A be the event that the 
selected person has brown hair. 

Notice we have defined A physically in terms of a property of the outcome. 
Now suppose we wish to represent this event in an abstract model. 

We can give each person in the population an identification code. An 
abstract outcome would be the ID of the randomly selected person. The 
abstract version of A would be the set of all IDs of people that have brown 
hair. 

The question is whether this representation of A is sufficient for our needs. 

Suppose that no one told you what property defines A, but instead showed 
you the entire set of people who have that property, would you be able to 
guess what the property was? 

The entire set of people with the property defining A consists of exactly 
those members of the population who have brown hair. If you became aware 
of that fact, you might guess that hair color was the property that defines 
A. On the other hand, it is conceivable that some other property might 
occur in exactly the same set of people. So we must admit that knowing 
the set of outcomes does not really tell you what physical property is under 
consideration. 

However, since you know the abstract representation of A as a set of 
outcomes, then, if a particular outcome occurs as a result of the experiment, 
you can tell whether or not event A occurred: just check whether the outcome 
is in the set which represents A. And that sort of information should be 
enough for a probability calculation. 
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We should keep in mind that mathematical terminology steals words from 
ordinary language. If a mathematical term is well chosen, then its meaning 
in ordinary language will suggest its mathematical meaning, but one cannot 
simply rely on ordinary language to guess the exact mathematical definition. 
The word “event” in ordinary language usually suggests that something in- 
teresting has happened. In mathematical probability theory an event is just 
a set of outcomes. 


The following is some standard terminology for situations where we want 
to systematically represent all possible outcomes for an experiment. 


Definition 2.2 (Sample space models). The set of all possible abstract 
outcomes is called the sample space, often denoted by the uppercase Greek 
letter 2 (“Omega”). The abstract outcomes are the “points” making up the 
sample space, and they are traditionally called “sample points”. A sample 
point is often denoted by the lowercase Greek letter w (“omega”). 

The sample space is said to be a “model” for an experiment when its 
sample points can be interpreted as the possible outcomes of that experiment. 

Certain subsets of the sample space will be referred to as “events”, al- 
though of course they are mathematical objects rather than physical events. 
When we use the sample space as a model for an experiment, these subsets 
provide mathematical representations for actual physical events. 

Any one-point set {w} is an event in the model. It represents the event 
that the result of the experiment is the outcome represented by w. 

Since w represents a possible result of an experiment, it would not be 
unreasonable to also say that w itself is an event. However, since we are rep- 
resenting general events as sets of sample points, probably it’s less confusing 
to stick to that, and use {w} rather than w when we are talking about events. 


It should be emphasized that a sample space is a mental concept. It 
represents something about the real world, but only indirectly. Even a very 
large sample space has no weight! 

For a sample space which consists of a finite number of outcomes, every 
nonempty subset of the sample space can be interpreted as representing a 
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physical event, although not necessarily an interesting one. In more compli- 
cated sample spaces, we don’t try to give an interpretation for every possible 
subset, so not every subset is called an event. 

We will find later that the general properties of probability are often 
sufficient to solve a problem efficiently without committing our thoughts to 
any explicit choice of a sample space. On the other hand, if we cannot think of 
any sample space at all to represent the outcomes of a proposed experiment, 
it might be a good idea to investigate whether the experiment makes sense. 


Here are some standard examples of experiments and corresponding sam- 
ple spaces. 


Example 2.3 (Tossing a coin once). The points of the sample space 2 
should represent exactly two physical outcomes, the occurrence of a head, 
and the occurrence of a tail. So we can take Q = {1,0}, a set with only 
two points. There are four events in the sample space 2: {1,0}, {1}, {0}, 0. 
The event {1,0} in the sample space describes a physical event which always 
happens. The empty set @ contains no sample points. It is a subset of the 
sample space, so by definition it is a mathematical event. But there is no 
physical outcome which would correspond to this event, so we will say that 
it is an impossible event. 

Like all sample spaces, the set is a thought in our heads, not something 
in the real world. We could use letters rather than numbers to represent 
sample points, so that “h” would mean a head was obtained and “t” would 
mean a tail. Then we would have 2 = {“h”, “t”}. What matters is the 
interpretation. The interpretation associates one sample point with the result 
in which the coin toss gives a head, and the other sample point with the result 
in which the coin toss gives a tail. 

For brevity, sometimes we'll refer to getting a head as “success”, and 
getting a tail as “failure”. Of course, the name doesn’t matter, and we could 
switch, and call getting a tail “success”, if we felt like it. 


Example 2.4 (Rolling a die once). Much as in the case of a coin toss, 
we can take 2 = {1,2,3,4,5,6}, so that the outcome is simply the number 
obtained on the roll of the die. 

There are 32 possible subsets of this sample space, and each subset is an 
event in the sample space which represents a physical event. For instance 
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the event {2,4,6} in the sample space represents the physical event that an 
even number was obtained. 


In general, for any property that you can express about the outcome of 
an experiment, there is a corresponding set of points in the same space which 
represents the same statement. 


Incidentally, we claimed that there are 32 possible events in the sample 
space for one roll of a die. Did that number make sense? 
In general, it useful to know the following fact. 


Lemma 2.5 (Number of subsets). Any set of size k has exactly 2* subsets. 
(The empty set is one of the 2" sets.) 


Proof. We can build a subset by making a decision for each object: “include” 

or “don’t include”. Thus we build a subset by making k decisions, each of 

which has two choices. This gives 2 x ... x 2 ways to build the subset. 
—JJ_—_—S 


k factors 


We can apply Lemma to the sample space for rolling a die. In that 
case k = 6, and 2* = 32. Each of the 32 subsets is the mathematical 
representation of a possible event. 


After considering tossing a coin once, we might consider tossing it n times, 
where n can be 1, 2,.... 


Example 2.6 (Tossing a coin a million times). An outcome in the 
sample space for the experiment of tossing a coin one million times must 
record the result of each toss! Our choice for a sample point is a sequence 
(%1,---,X1000000), Where each x; is either 1 or 0. We could use a similar 
sample space for tossing a coin n times, for any n. 

Tossing a coin one million times would not be practical for an individual, 
but it would be perfectly feasible in an industrial setting. Notice however 
that Q contains 210° sample points. (We did say that a sample space is 
a mental concept, rather than a real object, didn’t we?) Every subset is an 
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event in the sample space, and so there are ame possible abstract events! 


Each of these abstract events has a physical interpretation, though very very 
few of the details of such events are significant. 

It may seem absurd to consider such a large sample space. Nevertheless, 
since we are able to reason precisely in an abstract setting, we are able to 
reliably establish useful facts. 


Exercise 2.1. Consider the experiment of tossing a coin 400 times, and 
recording the result of all 400 tosses. This is not a very complicated experi- 
ment, and could easily be carried out by hand by one person. 

You can use a sample space for this experiment similar to that in Exam- 
ple Let N be the number of sample points in this sample space. 

According to google, the number 10* is likely an upper bound for the 
number of atoms in the observable universe. How does the number 10°? 
compare with N? 


Example 2.7 (Drawing a card). A standard deck of playing cards consists 
of 52 distinct cards. There are four types of cards. The types are called 
“suits”, and every card belongs to exactly one suit. Each suit has 13 cards, 
and the names of the suits are “spades”, “hearts”, “diamonds” and “clubs”. 

If the deck is shuffled a few times, cards become arranged in a fairly ran- 
dom order. Drawing the top card from the deck is equivalent to selecting one 
member of a population of 52 (with no member of the population favored). 
What would be a reasonable model for this sampling experiment? 

We could certainly number the cards, in an arbitrary manner. A number 


in {1,...,52} is then an abstract representation for a card, and we could build 
our model using these abstract “cards”. Let’s agree to call each number in 
{1,...,52} an abstract outcome of the experiment of drawing a card. 


Suppose that we are interested in the physical event A that a “heart” card 
is drawn. Since our abstract model contains an abstract outcome (a number 
label) representing each possible outcome, we can represent the event A as 
the set of abstract outcomes that represent “heart” cards. Thus A contains 
13 sample points. 
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We have already discussed experiments involving sampling from a bowl 
of jelly beans (Example [1.12), or from the population of a country. The 
reader will have no difficulty constructing appropriate sample spaces for these 
experiments, when needed. 


2.2 Distributions and set-functions 


Mathematical probability theory tells us how to reliably calculate new prob- 
abilities from given probabilities. Mathematics doesn’t tell us how to get the 
probabilities that we start with. To quote the physicist E.T.Jaynes, “ No 
matter how profound your mathematics is, if you hope to come out even- 
tually with a probability distribution, then at some point you have to put 
in a probability distribution” ({6]). The probabilities we start with must 
somehow come from the physical description of an experiment. 


Probability Assumption 2.1 (Existence of a distribution). When we 
work with a model, and represent events as subsets of a sample space, it 
is assumed that there is a mathematical probability P(A) for each event, 
although we may not know the value of every probability. 

This probability P(A) is of course a function of A. Since the domain of 
P is made up of sets, one often speaks of P as a probability set-function. 


Definition 2.8 (Probability models and probability terminology). A 
sample space, together with a given probability set-function, will be called a 
probability model. 

Any rule which specifies probabilities can be called a distribution (Def- 
inition [1.13). So a probability set-function can also be called a probability 
distribution, and we frequently use that terminology. 


Of course we often start analyzing a problem by thinking directly about 
probabilities for physical events connected with a particular experiment. 
There need not be any sample space chosen at that stage, so the proba- 
bility values P(A) are associated with the actual physical events A, or rather 
with our mental conceptions of them. In this case we would not think of P 
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as a set-function, but one can still refer to the family of values P(A) as a 
distribution. 


Let’s pause for a moment to compare what we are doing here with our 
discussions in Chapter [I} In that chapter we talked about probability facts, 
namely the frequency interpretation and additivity. But in Probability As- 
sumptions we are apparently starting to make assumptions. What hap- 
pened? Did we lose the courage of our convictions? 

Here’s what’s going on. In Chapter {1} we were talking about physical 
probability, for real-world situations. Now we are talking about abstract 
models, things which we can reason about mathematically. Our abstract 
models are indeed relevant to the real world, but only if we build in the correct 
mathematical assumptions. It is those assumptions that we are talking about 
here. 


Remark 2.9 (Interpreting a model). A “model” in mathematics may 
represent an experiment, but Definition [2.8]doesn’t say much by itself about 
the physical situation that a probability model represents. The connecting 
link between a probability model and the real world is the interpretation of 
the model, and the interpretation is not part of the mathematical definition. 
But we usually need to have at least a rough interpretation in mind to work 
successfully with a model. 


In this book we will study the general mathematical properties that prob- 
ability models have, and then apply those properties whenever we use a 
probability model to represent an experiment. 


In a valid interpretation of a probability model, the value of the probabil- 
ity for the abstract event A should be approximately equal to the probability 
of the physical event represented by A. 

Making sure that a model is valid is ultimately a physical problem rather 
than a mathematical one, although mathematics may help us to test the 
validity of a model. When we discuss the applications of a mathematical 
probability model in this book, we will confidently assume that our model is 
a valid one. In the real world such confidence can be misplaced. 


48 


2.3. Events defined in terms of other events 


2.3. Events defined in terms of other events 


Events in a mathematical model are represented by sets, and so relationships 
are often expressed using set language. Consequently, readers will need to 
know the basic terminology for set operations. This material is likely familiar, 
but Section [2.6]reviews all the concepts and notations which are needed. It’s 
a good idea to look through that section, since notations and terminology 
for set operations can vary slightly. 

Here are some set concepts and notations which are often used. 

For sets A;,..., Ax, the union of A,,..., Az is the set consisting of every 
element which is a member of at least one of the sets A,,..., Az. This set is 
denoted by A,;U...UA,. When A,,..., A, are events, the union A, U...UA,, 
represents the event that at least one of the events A;,...,A, occurred. 

For sets A,,...,A,z, the intersection of A,,...,A, is the set consisting 
of every element which is a member of all of the sets Aj,...,A,. This set 
is denoted by Ay M...M Ay. When Aj,..., An are events, the intersection 
A, M...M Ay represents the event that every one of the events Aj,...,An 
occurred. 

We often consider situations in which some events Aj,..., A, are mutually 
exclusive, meaning that at most one of these events can occur. In that case 
no sample point can be a member of more than one of the sets A;,..., An, 
and we say that these sets are disjoint. 

For any sets A, B, the set difference B—A is the set of all elements which 
are members of B but not A. And in situations where all sets are subsets of 
some fixed set U, it is convenient to write U— Aas A°. The set A° is referred 
to as the complement of A. 

We often denote of elements in a finite set S by |S|. If a set if not finite 
we say it is infinite, and say that |S| = oo. 

See Section [2.6] for more discussion of sets. 


Probability Assumption 2.2 (Set operations and sample space events). 
If A,,..., A, are events in a sample space, then so are A, U...U A, and 
A,M...M Ax. If A and B are events in the sample space, then so are A— B 
and A°. 


To justify this assumption, recall that events in the sample space corre- 
spond to meaningful statements about the physical result of an experiment. 
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If we think that statements a,,...,a, are meaningful, then surely we 
must also think that the statement “at least one of the statements a1, ..., a, holds” 
is meaningful, and “all of the statements a,...a, hold” is meaningful. 

It is also meaningful to say “ay, is true and qa, is not true”. 

Translating such observations into set language gives us Probability As- 
sumption 

With set notation, we can rephrase Probability Fact as follows. 


Probability Assumption 2.3 (Additivity of probability). Let D,,..., Dx 
be disjoint events in some probability model. Then 


PUD Ue D,) = PD) ae FP UD): (2:1) 
Also, probabilities in the model are such that 


P(Q) =1. (2.2) 


If we think think of a probability simply as a number that measures degree 
of belief, we could scale all our probability values up or down by a factor, 
without changing their usefulness. Equation tells us that we are using 
a belief scale for which certainty is 1. Of course this scale fits the statement 
of the frequency interpretation, so it is the natural scale for probability. 


Remark 2.10. If D,,..., Dx, are disjoint events in some probability model, 
and D; U...U Dy = Q, Probability Assumption [2.3]implies that 


P(D,) +... +P (D,) =1. (2.3) 


Thus Probability Assumption[2.3}includes the abstract version of equation (1.3). 


Remark 2.11 (Set notations without sets!). If A and B represent phys- 
ical events, we may still use set notation to describe combinations of these 
events, even when we are not representing A and B as sets. For example, 
the event that A occurs and B occurs will still be expressed as AN B, 

This convention can be justified in two ways. First, it is a convenient 
brief notation. Second, for any experiment, one could define some sample 
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space model to represent the experiment. In that case the event that both 
A and B occur would indeed be represented by the intersection of two sets 
in the sample space. 


Many examples in probability theory involve situations in which the sam- 
ple space is a finite set. In this situation, we can make some formulas a bit 
neater by introducing the following special notation for the probability of a 
one-point set. 


Definition 2.12 (Probability mass functions). For any probability model, 
we can optionally write P({w}) as p(w), for brevity. 

The function p is referred to as as a probability mass function, or more 
briefly as a probability function. 


The word “mass” in the name “probability mass function” is intended to 
suggest a lump of probability attached to each sample point. The probability 
of an event can be pictured as the sum of the masses of all the sample points 
in the event. 


2.4 Some basic examples 


In this section we’ll look at some simple examples, and use the properties 
of probability to find the probability of some simple events. Common sense 
reasoning would get us the same answers, perhaps more quickly, but it is 
important to see how the general assumptions of probability apply in simple 
cases. This will give us confidence in the theory. 


Example 2.13 (Probabilities for a single coin toss). There are only 
two possible outcomes. As in Example we can choose to represent these 
outcomes by 0 and 1. The outcome is 1 if a head is obtained, and the outcome 
is 0 if a tail is obtained. Thus the sample space 2 is given by 2 = {0, 1}. 

By equation (2.3), P({1}) + P({0}) = 1. Using the notation of Defini- 
tion [2.12] this says that p(1) + p(0) = 1. 

If the probability of a head is p and the probability of a tail is q, then 
p+o¢= 1. For 4 fair com, p=g = 1/2. 
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Example 2.14 (Probabilities for a single roll of a die). We take 2 = 
{1,2,3,4,5,6}, with the same interpretations as Example 

If the die is fair, then the probability of each possible outcome is the 
same, so P({1}) = P({2}) = P({3}) = P({4}) = P({5}) = P({5}) ice 
p(1) = p(2) = p(3) = p(4) = pl) = p(6). 

By equation (2.3), p(1) + p(2) + p(3) + p(4) + p(5) + p(6) = 1. 

Thus in the fair case p(i) = 1/6 for each 7. 


Exercise 2.2. Suppose that Q = {1, 2,3, 4,5, 6}, and assume that P({w}) = 
1/6 for each w € Q. Let A = {2,4,6}, so that A represents the physical event 
that an even number is obtained. Show from the definitions that P(A) = 1/2. 


Remark 2.15 (Symmetry in probability). In games, we generally try to 
use a fair coin. 

The true test of fairness is to toss the coin a large number of times, and 
see if we obtain approximately the same fraction of heads and tails. If we 
can’t do that, we can at least examine the coin carefully, to see if there is 
anything about the physical properties of the coin which would favor heads 
or tails. If we don’t find anything that favors one side of the coin, we would 
say that the coin is symmetric with respect to heads and tails. Since there 
is nothing that would lead us to assign a higher probability to one side over 
the other, it seems reasonable to assign equal probability to each of the two 
possible outcomes. 

Symmetry in probability calculations has been used for a long time, and 
in the old days it was sometimes described as “the Principle of Indifference” , 
or “the Principle of Insufficient Reason”. This principle says we should assign 
equal probabilities to possible outcomes if we have no positive reason to do 
otherwise. We already used a somewhat similar approach in Example 

The use of symmetry is dangerous if it is based on ignorance. For ex- 
ample, suppose you decide to gamble with someone who is tossing a coin, 
and you know very little about the person and the coin. As a believer in 
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the Principle of Insufficient Reason, you may feel you have no choice but to 
assign a probability of 1/2 to the occurrence of a head. If your new friend 
obtains five tails in a row you may regret this probability assignment. 

Of course, in a real-life situation, you need not stick to your original 
assumptions, when new information starts to come in. Chapter |4] (on con- 
ditional probability) deals with rules for updating probability assessments, 
when you obtain additional information. 

A general comment: even if you make a careful examination of the setting 
of an experiment, you may overlook some factor. Then the setting of the 
experiment may be less symmetrical than you think. 


As Remark notes, some probabilities are more reliable than others. 
In this book we are mainly interested in drawing conclusions about models 
that we consider to be reliable. 


Before considering more examples, let’s formally state a rule that seems 
rather obvious. We include a proof to emphasize that this rule follows from 
the assumptions that we have already made: additivity, and the fact that 
one-point sets are events. 


Theorem 2.16 (Finite events). Let 2 and P be a probability model. If 
A is a finite set of sample points, then A is an event, and 


P(A) = S>P({w}) = Yo plu). (2.4) 


weEA weEA 


Proof. By the definition of union, 


A=) }. 


weEA 


and the sets {w} are obviously disjoint. Additivity then gives equation (2.4). 
We already used this reasoning in Exercise 
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Equations (2.2) and (2.4) of course tell us that when (2 is a finite set, 


S> p(w) = P(Q) =1. (2.5) 


wEQ 


When setting up a probability model with a finite sample space, if we 
can decide on the value of P({w}) for each sample point w, then (by equa- 
tion (2.4)) all other probabilities P(A) are determined. So a simple proba- 
bility model is usually defined by listing the probabilities of the outcomes. 


Example 2.17 (Probabilities for tossing a fair coin twice). Consider 
the experiment of tossing a coin twice. As in Example we think of 1 as 
representing a head, and zero as representing a tail. The result of each toss is 
represented that way, and there are two tosses, so we take every sample point 
to be an ordered pair of numbers, each of which is either one or zero. The 
first number represents the result of the first toss, and the second number 
represents the result of the second toss. 

There are two choices for the first number, and two for the second number, 
so there are four sample points, and 2 = {(1,1), (1,0), (0,1), (0,0)}. Our 
interpretation is that (1,1) represents obtaining two heads, (1,0) represents 
getting a head followed by a tail, (0,1) represents getting a tail and then a 
head, and (0,0) represents getting two tails. 

We need to find p((1, 1)), p((1,0)), pl (0; 1)), p((0, 0))). 

It will be easy to find these probability values, once we introduce the 
general concept of zndependence (Section[p. 1p. If you have used independence 
in any previous study of probability, you must be impatient to use it here! But 
for the moment we’ll just consider the fair case, and calculate probabilities 
based on an extra assumption: that all outcomes should be equally likely. 

By equation (2.5), the four outcome probabilities should add to one, and 
so 


p((1,1)) = pC, 0)) = (0,1) ) = p( (0,0) ) = : 


Exercise 2.3. In the two-toss experiment of Example|2.17, when the coin is 
fair, use the four-point sample space 2 to calculate the probability that the 
same result is obtained on both coin tosses. 
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Exercise 2.4. In the two-toss experiment of Example[2.17| when the coin is 
fair, use the four-point sample space 2 to calculate the probability that the 
first toss produces a head. 
You know the answer already, but we are checking here that the sample 
space for two tosses is consistent with the sample space for one toss. 
Exercise [5.2] will show that the same result holds when we model tossing 
a general coin, one which is not necessarily fair. 


Exercise is an example of what we do when getting familiar with a 
new tool. We check that it works properly! 


Exercise 2.5 (First toss of a million). In Exercise you considered 
finding the probability of success on one toss of a fair coin, when using the 
model for two tossses. To no one’s suprise, the model for two tosses agrees 
with the model for one toss. 

How about using the model for tossing a fair coin a million times, as in 
Example |2.6? That has to work too, doesn’t it? But you will check that 
now. 

You are only allowed to work with the big sample space. Any event you 
consider must be a subset of that space, which has 21220 points. 

Just as in the case of tossing a fair coin twice (Example [2.17), we will as- 
sume that all sample points have the same probability. And that probability 
a 
Let A be the event that the very first toss of the coin results in a head. 
Using the big sample space, find P(A). 

(And yes, we will rerun this problem in Exercise [7.8] for the case of a coin 
which might be unfair. That works too.) 


Example 2.18 (Probabilities for two rolls of a fair die). Just as we 
can toss a coin twice, we can roll a die twice, or roll two different dice at 
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the same time. The sample space is larger, but the principle is the same. 
Q = {(i,7): t=1,...,6, 7 =1,...6}. There are 36 sample points. 


Assume that the die is fair. We would like to know the probability distri- 
bution on this sample space. We are willing to assume that all sample points 
of 2 have the same probability p. 


By equation (2.5), 36p = 1. Hence p( (i, 7) ) = 1/36 for all 4, 


In problems involving experiments with two steps, it is often helpful to 
list the sample points in a table. Let the row indices refer to the first element 
in a pair and column indices refer to the second element in a pair. Then we 
list Q as: 


1 2 3 4 5 6 
1 | (1,1) | (1,2) | (1,3) | (1,4) | (1,5) | 4,6) 
21 (2,0) | (2.2) | (2,3). | (2,4) | (2,5) 1/¢ 2.6) 
3 | 48,1) 1 (8,2) 143.8) | (8,4) | 139) | (36) (2.6) 
A | (4,1) | (4,2) | (4,3) | (4,4) | (4,5) | (4,6) 
5 | (5,1) | (5,2) | (5,3) | (5, 4) | (5, 5) | (5, 6) 
6 | (6,1) | (6,2) | (6,3) | (6,4) | (6,5) | (6,6) 


We will revisit this experiment after introducing the concept of indepen- 
dence (Chapter |5). 


One is often interested in the sum of the scores on the two dice. Let A; 
be the event that the sum of the numbers obtained on the two rolls is equal 
to k. 


The largest possible sum is 12. So we see that A, is empty for k > 12. 
The smallest possible sum is 2. So A, is empty. 


To find P(A;), we need to count the number of outcomes in (2,22) in 
Ax, for each k with 2 < k < 12. Each outcome has probability 1/36, and we 
add these probabilities, as usual. 
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Ay = {(1,))}, P (Aa) 


care 


A3 = (1,2); (2, ine P (As) = ap 


Ag = {(1, 3), (2, 2), (3, ihe P (Ag) TT 


As = {(1, 4), (2, 3), (3, 2), (4, ie r (As) = apo 


Ag = {(158)5 (2, 4), (3, 3), (4, 2), (4, 1)}, P (Ag) a 


A; = {(1,6), (2,5), (3,4), (4,3), (5, 2), (6, I}, P (Ar) = = 


Ag = {(2,6), (3, 5), (4, 4), (5, 3), (6,2)}, P (As) = Bae 


Ag = {(3, 6), (4, 5), (5,4), (6,3)}, P (Ag) = 


Ajo = {(4, 6), (5, 5), (6, 4)} i (Ajo) = 


Au = {(5, 6), (6,5)}, r (Ai) a 


Aig = {(6,6)}, P (Ai2) = 


1 
36 
2 
36 
3 
36 
4 
36 
36 
2.7 
7 (2.7) 
5 
36 
4 
36” 
= 
36” 
2 
36 
1 
36 


It can also be helpful to list the values of the sum in the same tabular 


form as the listing of the sample points given in equation (2.6). 


The row 


and column indices continue to indicate the first and second members of the 
sample point. Thus the sums corresponding to the sample points are: 


1/2/3|4|5|6 
1/2/3/4/5/6|[7 
213/4/5/6/7/8 
3/4/5/6|7/ 8/9 (2.8) 
4/5/6/7/ 81/9/10 
516|/7/8/ 9/10/11 
6/7/8|9/ 10/11/12 
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Exercise 2.6. In the experiment of Example let A be the event that 
the first roll produces the number 5. 

Find P(A), using the sample space 2 of Example [2.18] 

You are checking that the two-roll model is consistent with the one-roll 
model. And yes, yes, again this is obvious physically. We are just testing for 
bugs in our mathematical machinery. 


Exercise 2.7. Consider the experiment of rolling a fair die twice. 

Find the probability that the first roll produces an even number and the 
second roll produces a number larger than four. 

Asin Example[2.18} let Q consists of the pairs (x1, 22), where x; = 1,...,6 
and. go = 155.240. 

We will return to this problem in Exercise 


Exercise 2.8. Again consider the experiment of rolling a fair die twice. Find 
the probability that the sum of the numbers obtained on the two rolls is less 
than or equal to 5. 


Exercise 2.9. When rolling a fair die twice, let C be the event that the sum 
of the numbers obtained on the two rolls is an even number. 

Find P(C). 

Let D be the event that that sum of the numbers obtained on the two 
rolls is larger than 6. 

Find P(D) and P(C ND). 


Example 2.19 (Probability of drawing a card from a deck). This is 
the experiment defined in Example We said that drawing the top card 
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from a deck is equivalent to selecting a member of a population of 52, with 
no member of the population favored. Thus each card has same probability 
to be drawn, and we know these probabilities sum to one. Hence each card 
has the probability 1/52 to be drawn. 

Sometimes people think about dealing cards, which means removing cards 
repeatedly, starting from the top of the deck. The deck is shuffled before 
dealing, to arrange the cards of the deck in random order. Thus the third 
card dealt from the deck is a random sample from the deck, just as the top 
card is a random sample. And so the probability that any particular card 
will be the third card dealt is exactly the same as the probability that it is 
the first card dealt, 1/52 in both cases. 

We can picture the process more concretely if we think of randomly laying 
out all 52 cards face down on a table, forming a long row. Instead of drawing 
the top card from the deck, one might think of turning over the first card in 
the row. The third card drawn is the third card that we turn over, and so 
on. The probability that a particular card is in the first position is clearly 
the same as the probability that it is in the third position. 


Exercise 2.10. Let Q = {w1, wo, w3,W4,W5} be a sample space with associ- 
ated probability mass function p. Suppose that p (we) = 2p (w1), p(w3) = 


3p (wo), P (wa) = 4p (ws), p (Ws) = 5p (wa). Find p (ws). 


Exercise 2.11. A certain combination lock will only open when the correct 
code is entered. The code consists of 4 digits in order. The allowable digits 
are 0,...,9. A stranger who does not know the correct code attempts to open 
the lock by entering 4 arbitrarily chosen digits. Find the probability that the 
lock opens. Express your reasoning in terms of an appropriate sample space 
and a probability mass function. If it seems appropriate with your model, 
you may assume that all sample points are equally probable. 
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Exercise 2.12. An experiment consist of tossing a certain coin six times, and 
counting the number of heads which are obtained. If we regard the outcome 
of the experiment to be the number of heads which are obtained, then an 
appropriate sample space for this experiment is 2 = {0,1,2,3,4,5,6}. This 
sample space is only adequate for listing the outcomes. It is definitely not 
an adequate sample space for computing probabilities of outcomes. 

Suppose that the coin which is used in this experiment is unfair, and 
actually the probability of a head on each toss is 1/3. We will show later 
that the correct probability mass function p for 2 is given by 


OOO” 


where (‘) is the binomial coefficient given by 


(We will not use this particular sample space Q when we derive this for- 
mula. This sample space is too simple to represent what is going on in the 
experiment, which involves a number of steps.) 

As a small test of whether our formula for p is correct, use the binomial 
theorem (if you happen to know it) to verify that the values of p sum to one. 
If you haven’t met the binomial theorem before, omit this problem. And do 
not worry, the binomial theorem is derived in Section 


Exercise 2.13 (The number wheel experiment). At a booth in a fair- 
ground, we find a large wheel marked with the numbers from 0 to 100. By 
spinning the wheel, and seeing where it stops, a random number is chosen. 
This will be considered as the outcome of an experiment. 


(a) Provide a suitable sample space for this experiment. Assume that each 
outcome has equal probability, and find the probability mass function. 


(b) Answer the following questions. 


(i) What is the probability that the number is 3? 
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(ii) What is the probability that the number is even? 


(iii) Let A be the event that the number is smaller than 20, and let B be 
the event that the number is larger than 60. Find P(A U B). 


(iv) What is the probability that the number is less than 50 and is divisible 
by 3? Remember that zero is divisible by any number. 


Exercise 2.14 (Probability of a complement). The following obvious 
consequence of additivity is often surprisingly useful. Let A be an event for 
some experiment. Prove that 


P(A°) =1— P(A). (2.9) 


Exercise 2.15. In Exercise |2.11| find the probability that the lock does not 
open. 


Exercise 2.16. Using the probability model in Exercise |2.12| find the prob- 
ability that at least one head is obtained in the six tosses. 


Exercise 2.17. Let A, B be any events. Show that A and B—A are disjoint, 
and 
B=(ANB)U(B- A), (2.10) 


and so by additivity, 
P(B) = P(AN B)+P(B - A). (2.11) 
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() 


Figure 2.1: Exercise B=(ANB)U(B- A). 


See Figure 


Example 2.20 (Choosing a positive integer). Suppose someone says to 
you: “Think of a number, any number.” Probably they mean that you should 
choose a positive integer, and they don’t want you to favor any particular 
number. Strictly speaking, this is impossible! To check that, consider the 
following argument. 

Let p,; is the probability that you choose k. Assume that p,; is the same 
for all k. Let c be the value of px. 

By additivity, the probability that you choose a number less than or equal 
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to n is exactly 
prt... + Dn. 


Any probability is less than or equal to one, so you must have 


ppt... +P, <1. 


If it is really true that py, = c for all k, then 
: 1 
ne<1, ie. cS. 
n 


This can only be true for all n if c = 0. But then the probability of choosing 
a number less than or equal to n is zero, for every n. So the chance that you 
choose a number less than a million is zero, and the chance that you choose 
a number less than a trillion trillion is also zero, and so on. So it seems you 
will stand silently. No one will want to play this game with you! 

In real life, if someone asks you to think of a number, you will proba- 
bly not have a precise recipe in mind, but you likely have a finite range of 
possible numbers in mind, and try to choose one of them without being too 
predictable. 


Exercise 2.18 (Probability one is essentially everything). Suppose 
that P(A) = 1. For any event B, prove the following statements. 


(i) P(B — A) =0. 
(ii) P(BN A) = P(B). 


Here is one more fact that is often useful. 


Lemma 2.21 (Monotonicity). If A; and Ag are events, 
Here we use = > to mean “implies”. 
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In words, we can say that probability is monotone increasing as a function 
of events, i.e. bigger sets give bigger probabilities. No surprise here, and 
that’s good! 


Proof. From the definitions, Ay = A; U (A, — A;), and the sets A;, Ap — Aj 
are disjoint. (Figure [2.1] shows how to picture such relationships.) 
Hence P (Ag) =P (A;) + P (Ag = Aj). 


Notice the technique in this proof. We broke sets up into disjoint pieces, 
and then used additivity. This is a general trick. It is used, for example, in 
the proof of equation [2.13] 


Definition 2.22 (Uniform distribution on a finite set). When a prob- 
ability model uses a finite sample space and assigns the same probability 
to every sample point, we will refer to this assignment of probabilities as a 
uniform distribution on the finite set 2. 


Theorem 2.23 (Fair sampling). Let S be a set of n objects, and let T be 
a subset of S containing 7 objects. Suppose an object is chosen randomly 
from S, in such a way that all objects in S are treated the same way by the 


selection process. Then the probability that the chosen object is a member 
of T is j/n. 


Proof. The simplest choice for a sample space is Q = S. 

Then the set T’ is also the abstract representation of the event that the 
selected object lies in 7’. 

We want to show that P(T) = j/n. 

We are told that there is symmetry in the selection process: all objects 
are treated in exactly the same way. Hence P({w}) is the same for all w € 1. 
Let’s call this number p. 

Since P(Q) = 1, we know by additivity that 


> p(w) =1. 


wEQ 
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Hence np = 1, sop=1/n. 
Using additivity again, 


P(T) = ple) = ip = 2. 


weT 


It should be emphasized that Theorem is not a surprising fact. If 
there are n ways that something can happen, and if 7 of those ways are 
“sood”, and all ways seem equally likely, we would naturally think that the 
likelihood of a good result depends on how big 7 is, compared to n. So j/n 
is the value we expect for the probability of a good result. 

With that in mind, the proof of Theorem can be thought of as yet 
another test of the theory of probability. The general theory gives the answer 
we expect. 


Exercise 2.19. Theorem [2.23] deals with the situation of Example So 
let’s review jelly bean selection. 

Consider picking a jelly bean randomly from a bowl. Suppose that there 
are 75 yellow beans, 53 red beans, 27 purple beans, and 18 green beans in 
the bowl. Find the probability that the selected jelly bean is red. 


Exercise 2.20. (a) A box contains 25 white marbles and 13 blue mar- 
bles. Our experiment consists of randomly selecting one marble. We assume 
that each marble has the same probability of being selected. What is the 
probability that the selected marble is blue? 


(b) Now we prepare a new experiment, which we will call experiment 2. 
We replace every blue marble in the box by 10 blue marbles, and we replace 
every white marble in the box by 10 white marbles. The actual procedure 
for experiment 2 is the same as before: randomly select one marble, in such 
a way that every marble has the same chance of being selected. What is the 
probability that the selected marble is blue? 
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Example 2.24 (Choosing two beans). Return to the setting of Exer- 
cise A new experiment in this setting consists of randomly selecting 
two jelly beans. If the chooser is planning to eat the jelly beans, it seems 
clear that the precise manner in which the beans are extracted from the bowl 
does not matter. The outcome here should be the set of two beans that is 
selected. 

Let A be the event that a red bean and a green bean are selected. We 
would like to find P(A). 

No jelly bean is favored in the choosing, so any set of two beans has the 
same chance of being selected. This is the crucial fact, since it allows us to 
use Theorem [2.23 

Using the chosen set as the sample point, Theorem [2.23] tells us that 


|A| 
P(A) = 5. 
where as usual we denote the number of elements in a set S' by |S]. 

However, we still need to calculate |A| and |Q|. The necessary formula is 
given later, in equation |8.2| But we don’t have to wait until we get to that 
equation. The idea that is used in deriving equation [8.2]can already be used 
right here: we will think about selecting the jelly beans one at a time. 

This may seem unnecessarily complicated, since when we eat the two jelly 
beans we don’t care which one was chosen first. However, it seems to be a 
good way to calculate the probability that a particular set of two jelly beans 
is chosen. 

Notice that by thinking about getting the jelly beans one at a time we 
have modified our experiment. Now it is a two-step experiment. We must 
define a new sample space 2. Now a sample point w is not a set of two jelly 
beans from the bowl, it is an ordered pair (b,,b2), where b; represents the 
first jelly bean chosen, and bj represents the second. 


A key point: We definitely want to eat two jelly beans, so we only allow 
sample points with b) # b;. That is, after the first bean is selected, it is 
removed from the bowl, and is no longer available for the second selection. 

No jelly bean is favored, so again all sample points are equally likely. 
Using Theorem [2.23] in this new sample space, we have 


_|Al 


P(4) = op 
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What is the event A using this sample space? A is the set of all pairs 
(b;, b2), where either 6; is red and by is green, or else 6, is green and bg is red. 
There are 53 x 18 ways of choosing a red bean and then a green. There are 
18 x53 ways of choosing a green bean and then ared. Thus |A| = 2x (53x 18). 

We find |Q| in much the same way. The total number of beans is 75 + 
53 + 27 + 18 = 173. Hence there are 173 ways to choose the first jelly bean. 
Having chosen the first bean, there are then 172 ways to choose the second 
jelly bean. 

Notice that the choice of first bean determines the possible choices for 
the second bean, but the number of choices for the second bean is always the 
same, and does not depend on what the first bean was. 

Combining our facts, 


Pi = ee 
~~ Tse 172 


Exercise 2.21 (Choosing two red beans). In the setting of Example|2.24| 
let R be the event that both chosen beans are red. Find P(R). 


2.5 Beyond additivity 


It is useful to say something about probabilities for unions which are not 
disjoint! 


Theorem 2.25 (Inclusion-Exclusion formula). Let A and B be any 
events, and let P be a probability set-function. Then 


P(AU B) = P(A) + P(B) — P(ANB). (2.13) 


The reason for the name of this formula will be evident from the proof. 


67 


Chapter 2. Assumptions for probability, and their consequences 


Proof. If an outcome is in A U B, and it is not in both events, then either 
the outcome is in A but not in B, or else the outcome is in B but not in A. 
It follows that A U B is the disjoint union of AN B, A— B, or B— A. By 
additivity, 


P(AUB) =P(A-— B)+ P(B-— A)+P(ANB). 


Similar arguments show even more easily that P(A) = P(A— B)+ P(ANB) 
and P(B) = P(B — A)+ P(AnN B), and follows. 

Relating this proof to the name of the formula, note that AN B is the set 
of outcomes which are included in both A and B, while A — B is the set of 
outcome we obtain from A when we exclude the outcomes in B. 


In this proof we used the technique of breaking up non-disjoint sets into 
disjoint pieces. This is often a useful procedure. We have seen it already in 
the proof of Lemma [2.21] 

In the statement of Theorem consider the special case of finite sets 
for which all the outcomes have the same probability. Remember that in 
this situation we find the probability of any event by simply counting the 
number of outcomes in the event, and then multiplying by the probability 
of a single outcome. We can then say that the term —P(ANM B) in 
compensates for “double-counting” outcomes, since any outcome in AM B 
contributes both to P(A) and to P(B). 

Please work Exercise [2.22] after reading the next theorem. 


Theorem 2.26 (Subadditivity). Let A;,...,A, be any events for some 
probability model. Then 


P(A, U...U Ax) < $> P(A). (2.14) 


Proof. Consider the case k = 2. Let A = A,, B = Ay. Equation 
follows at once from (2.13). 

This proves the theorem for k = 2. 

The statement is obviously true for k = 1. (Right?) 
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A proof by induction for the case k > 2 is left to the reader, in Exer- 
cise [2.23 


Exercise 2.22 (Obtaining an estimate). In Exercise|2. jon sample spaces, 
we considered the experiment of tossing a fair coin 400 times. 


(a) We accept that for each sample point w, P({w}) is exactly the same. 
What is this probability value? 


(b) Consider the event A of ever obtaining 20 heads in succession during 
these 400 tosses. By definition, A occurs if there is any index k such that 
you get a head on toss k, toss k + 1, toss k+2,..., toss k + 19. 

Does this event feel likely or unlikely? 


(c) Use subadditivity to find an estimate for the probability of A, and 
decide whether A is likely or unlikely. 


Exercise 2.23 (The Old Induction Trick). Prove the case k > 2 of 
Theorem [2.26] (This sort of argument, passing from k& = 2 to general k, is 
useful in many situations. If you haven’t seen it before it is worth working 
through.) 


2.6 A review of set operations 


Since we represent physical events by sets of abstract outcomes in a sample 
space, set operations will play a basic role. 

This section contains definitions and notations for all the standard set 
operations. Readers can quickly skim through it, and then refer back again 
as needed. Notations and terminology for sets can differ slightly, so even an 
experienced reader might benefit from a quick survey. You might want to 
recall something that J.R.R. Tolkien said about hobbits: “they liked to have 
books filled with things that they already knew, set out fair and square with 
no contradictions”. We are entering hobbit-mode now. 
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Here we go. The members of a set can be called “elements” of the set, 
or “points” of the set, although of course such points need not have any 
geometrical meaning. Sometimes we may list the contents of a set as a 
sequence. The order in which the contents are listed is irrelevant, since sets 
are not ordered. The words “set” and “collection” have the same meaning 
throughout this book. Use of the word “collection” makes it possible to 
avoid too many repetitions of the word “set”. For example if we happen to 
be dealing with a set of sets, we would tend to use the phrase “collection of 
sets” rather than “set of sets”. 


Definition 2.27 (Unions of sets). Let Aj,...,A, be any sets. The union 
of A,,..., A, is the set consisting of every element which is in at least one of 
the sets A;,...,A,. We can write the union of two sets A,, Az in symbols as 
A, U Ag, and the union of Aj,..., A, as Ay U...U Ag. 


It is easy to check from the definition that AU B = BU A, or in other 
words that union is a commutative operation. It is also easy to check that 
AU(BUC) =AUBUC = (AUB)UC, so that union is an associative 
operation. 


Definition 2.28 (Intersections of sets). Let Ai,...,A, be any sets. The 
intersection of A,,..., Az is the set consisting of those elements which are in 
every one of A,,..., Az. We can write the intersection of two sets A;, A» in 
symbols as A; M Ag, and the intersection of A,,..., A, as Ay MN... Ag. 


Like union, intersection is a commutative and associative operation, as 
can easily be checked. 

Usually a set that we deal with is defined by some property, i.e. a sample 
space event is the set of all sample points which have a certain property. 
For sets we have the option of using property language as an alternative 
to set language. Union corresponds to “or” and intersection corresponds to 
“and”. That is, if set A is the collection of objects that satisfy property a, 
and set B is the collection of objects that satisfy property @, then AU B is 
the collection of all objects for which “a or 6 ” is true, and AM B is the 
collection of all objects for which “a and 6” is true. Writing “AUB” seems 
a little shorter than writing “a or 8”, but is not necessarily clearer. 
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Remark 2.29 (The inclusive sense of the word “or”). It should be 
emphasized that when we say that AU B is the collection of all objects for 
which “a or 6” is true, we are using the word “or” in the inclusive sense, 
which includes the possibility that both statements might be true. 

The inclusive sense is one of the two correct uses of the word “or” in 
English. For example, if I say, “I dream of being rich or famous”, this does 
not mean that I would be heartbroken if I were both, so I am using the 
inclusive sense. 

On the other hand, suppose you are ordering supper at your favorite 
diner, and your order comes with a free dessert. When the waiter says: “You 
can have jello or rice pudding”, this is very likely an example of using “or” 
in the exclusive sense, meaning that exactly one of two possibilities is true. 

In mathematics, if we mean “or” in the exclusive sense, we will say so 
explicitly, unless it is obvious. 


Definition 2.30 (Set difference and complement). For any sets A and 
B, A—B denotes the set difference, simply meaning the set of elements 
which are in A but not in B. The set difference A — B is sometimes written 
as A \ B, but we won’t use that notation. 

If you think that your reader knows the “universe” U of elements that 
you are currently interested in, then for any set A contained in U, the set 
U—Acan be written more briefly as A°. The set A° is called the complement 
of A. In probability theory the set U is often the sample space 22. 


Just as union corresponds to “or” in property language, and intersection 
corresponds to “and” in property language, set difference and complement 
correspond to “not” in property language. 

If a subset A of the sample space represents the occurrence of a certain 
physical event FE, then A‘ represents the event that E’ does not occur. Com- 
plements are sometimes more convenient than set differences. 


Exercise 2.24 (De Morgan’s Laws). Please verify the following facts: 


(A°)° = A, (2.15) 
(AU B)* = ANB’, (2.16) 
(AN B)* = A°U B®. (2.17) 
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In property language, equation (2.15) expresses the meaning of a “double 
negative”. Equations (2.16) and (2.17) are known as De Morgan’s laws. 
Using equation (2.15) one can deduce either of De Morgan’s laws from the 
other. 


Definition 2.31 (Venn diagrams). Visual images seem to aid our thinking 
at times, and books often represent sets and their relationships with pictures, 
called Venn diagrams. Venn diagrams for sets are not pictures of actual sets, 
but they are schematic representations which show certain properties. 

Most readers will have seen such diagrams, often outside mathematics. 
Figure [2.1]is a good example. 


In general, readers are encouraged to follow any urge to draw pictures 
when thinking about any problems or concepts! 


Definition 2.32 (Set membership). We can express membership in a set 
by “Ee”, Thus x € A means that x is a member of A. 


Using € takes less space than using the word “in”, so we'll tend to use € 
in formulas later. 


Definition 2.33 (Set comparison). We write A C B to mean that every 
member of A is also a member of B. In this case we say that A is a subset, 
or that A is included in B. 

If A is asubset of B, but A is not equal to B, we say in words that A is a 
proper subset of B. We do not have a separate notation for proper inclusion. 
(The inclusion relation is sometimes written A C B, in which case A C B 
might denote proper inclusion, but we won’t use that convention. ) 


The word “contains” is used in two ways for sets. If « € A we say that 
A contains x, but occasionally if A C B one also says B contains A. The 
context usually makes the meaning clear. 
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Definition 2.34 (Disjoint sets and the empty set). Any sets A, B are 
disjoint if there is no element which is in both sets. Thus A and B are 
disjoint if AM B =, where @ denotes the empty set. 

Sets A,,..., A, are disjoint if there is no point which is a member of more 
than one of the sets A;,..., Ag. 


In property language, disjoint sets correspond to mutually exclusive 
properties. If A,,..., A, are disjoint events in a sample space, then a sample 
point can be a member of at most one of A,,...,A,. That is, at most one 
of the corresponding physical events can occur. 


Exercise 2.25. When manipulating sets we often use simple observations 
such as 


ACER = ANB=A, 


AN(B-A)=6. ers) 


Here we use = > to mean “implies”. 
Please prove the facts in equation (2.18). 


Number of elements in a set A set can be finite or infinite. The number 
of elements in a finite set S' will be denoted by |S]. If S is an infinite set we 
will write |S| = oo. 


Exercise 2.26 (Intersection distributes over union). Prove that 


BN(A,U...U Ay) = (BN A))U...U(BOA,). (2.19) 


Equation (2.19) can be expressed by saying that “and distributes over 
or”. 

You are asked to show in the next exercise that “or distributes over and”. 
This second fact is not hard either, but it is worth checking, especially since 


we only have one distributive law in the case of numbers! 


73 


Chapter 2. Assumptions for probability, and their consequences 


Exercise 2.27 (Union distributes over intersection). Write an equation 
in terms of set operations expressing the fact that union distributes over 
intersection. Then prove this equation. Give two proofs. The first proof 
should only depend on the basic definitions of union and intersection. The 
second proof should use equation and De Morgan’s Laws. 


In practical situations, using “and” and “or”, we easily recognize the 
truth of the rules in Exercises and even though we may not think 
of them abstractly. 

George Boole seems to have been the first person to observe (in 1854) that 
such general algebraic properties can be formulated for logical statements 
involving “and”, “or”, and “not”. 


2.7 Solutions for Chapter 


Solution (Exercise |2.1}. (a) We can take the sample space to be the 
set of all sequences (2 ,..., 499), Where each x; can be either H or T’. Since 
there are two choices for each x;, the sample space contains 24° points. 


(b) Writing 24°° as (2*)1°° = 161°, we see that the number of points in the 
sample space is much larger than 10!°°, and hence it is much larger than the 
number of atoms in the observable universe. 

Of course the number of abstract events for the sample space is 2%, which 
is a far bigger number than N. 


Solution (Exercise |2.2)). Since A = {2, 4, 6}, 
A = {2} U {4} U {6}, 
so by the additivity of probability we have 
1 1 1 61 
POL Pe PP Oe Sr ree 
6 6 6 2 
Solution (Exercise |2.3). Let A be the event that the same result is ob- 
tained on both coin tosses. Then 


A= {(1,1), (0,0)} = {C1 D}U {(0,0)}. 
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By the additivity of probability, 
1 1 
P(A) = P({(1,1)}) + P({(,0)}) = 5 +7 = 5 
Solution (Exercise |2.4). Let Q = {(1, 1), (1,0), (0, 1), (0, 0)}. 
Remember that the sample point (1,0) represents the result that the first 


toss gives success (a head) and the second toss does not, and so on. 
Let H, denote the event that the first toss produces a head. Then H, = 


{(1, 1), (1, 0)}, so 
1 1 1 


P (Hi) = P({(1,1)}) + PU, 0) =F 4+ 7=5, 


| Fe 


in agreement with the probability found using the sample space for one coin 
toss. 


Solution (Exercise [2.5). To save writing, let N = 1000000. 
Let (71,...,2y) be a sample point in A. 
Then x; = 1, and there are two choices for each of the remaining x;, for 
1 2503 V. Whus |Al = 2", 
Since the coin is fair, each of the 2% sample points is equally likely. Thus 
P({w}) =2-% for each sample point w, and so 


PAS Pog =3 = 7 
as we knew. 
Solution (Exercise [2.6). Using the sample space of Example 
A = {(5, 1), (5, 2), (5,3), (5; 4), (5, 5), (5, 6)}. 
Hence 
P(A) = P({(5, D})+P(L(S, 2) +P (105; 3) +P LS, 4) +P(L(5, 5) })+P({(5, 6)}). 
Hence P(A) = 6/36 = 1/6, consistent with the model for rolling a single die. 


Solution (Exercise (2.7). Let A the event that the first roll gives an even 
number. Let B be the event that the second roll gives a number larger than 
four. Our sample space consists of all pairs (11,22), where each x; can be 
1, 2,3, 4,6 or 6. 
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Each outcome (21,22) has probability 1/36. 

To obtain an outcome in AM B, it is easy to see that there are 3 ways to 
choose x; and 2 ways to choose x2. Thus there are 3 x 2 = 6 outcomes in 
AN B, so P(AN B) = 6(1/36) = 1/6. 


Solution (Exercise (2.8). Using the sample space of Example let A 
be the event that the sum of the numbers on the two dice is at most equal 
to 5. Consider an outcome (x1, 2%) in A. 

Since £2 is greater than zero, x, cannot be larger than 4. When x, = 1, 
%_ can be any of 1,2,3,4. When 2; = 2, x2 can be any of 1,2,3. When 
Z1 = 3, 2 can be 1 or 2. When x, = 4, x2 must be 1. Thus the number of 
outcomes in A is equal to 4+3+2+1 = 10. Hence P(A) = 10(1/36) = 5/18. 


Solution (Exercise |2.9)). The sets Ao,...,Aj2 are disjoint. 


We will use equation (2.7). 
Since C' = Ag U Ay U Ag U Ag U Ajo U Ajo, 


P(C) = P(A) + P(A4) + P(Ag) + P(Ag) + P(Aio) + P(Ai2) 
1 3 5 5 8 ‘. 1 18 1 
36 336) 36 36s 336s 86si—i G—éi 


(2.20) 


Since CM D= Ag U Ato U Ajo. 


5 
P D)=P(A P(A P(Aj2) = = _, 
(CN D) (Ag) + P(Aio) + P(Ai2) 36.3636. 4 
Looking ahead to the topic of independence (Chapter 5), note that P(C NM 
D) # P(C)P(D), so the events C, D are not independent. 


Solution (Exercise [2.10). Since P(Q) = 1 we have p(1) + p(2) + p(3) + 
p(4) + p(5) = 1. 
We find easily that p(w,) = n!p (wi), for n = 1, 2,3, 4,5. 
Hence (1+2+6+ 24+ 120) p(wi) = 1, and so p(w) = 1/153. This 
gives p (ws) = 3!/153 = 6/153. 
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Solution (Exercise 2.11). The sample space 2 can be taken to be the set 
of all sequences (ky, ko, k3,k4), where each k; is in {0,...,9}. Since there are 
10 choices for each k;, the number of sample points is 10*. We have no reason 
to think that the owner of the lock prefers a particular code, so we consider 
that each sample point w has the same probability p(w) of being the correct 
code. These probabilities add to one, so p(w) = 1/10000 for all w. Hence for 
any given sample point, such as the sequence which the stranger enters, the 
probability that this particular sequence is the correct code is 1/10000. 


Solution (Exercise |2.12}). By the binomial theorem, 


If we take a = 1/3 and b = 2/3, the right side of this equation is p(0) +...+ 
p(6). The left side is clearly equal to one. 


Solution (Exercise |2.13). (a) © = {0,1,...,100}. Since the probability 
values are equal and sum to 1, p(w) = 1/101 for every w. 


(b) 
(i) P ({3}) = p(3) = 1/101. 


(ii) There are 51 even numbers in the sequence 0,1,...,100. Summing up 
p(w) for these w, the probability is 51/101. 


(iii) 
Since A contains 20 numbers, P(A) = 20/101. Since B contains 40 
numbers, P(B) = 40/101. A and B are disjoint, so P(A U B) = 
P(A) +P (B). 


(iv) Numbers divisible by 3 are of the form 3 x k. Numbers of this form 
which are less than 50 are the numbers 3 x 0,3 x 1,3 x 2,...,3 x 16. 
Hence there are 17 numbers in the event described, and the event has 
probability 17/101. 


Solution (Exercise |2.14). Q = AU A‘, and this is a disjoint union. By 
additivity, P(A) + P(A‘S) = P(Q) = 1. 
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Solution (Exercise |2.15). From the solution to Exercise |2.11} the prob- 
ability that the lock opens is 1/10000. Hence the probability that the lock 
does not open is 1 — 1/10000 = 9999/10000, 


Solution (Exercise |2.16)). By the statement of Exercise |2.12| the proba- 
bility that no head is obtained is Gr Hence the probability that at least one 
head is obtained is 1 — (cy: 


Solution (Exercise 2.17). By definition, ifw € B— Athen w ¢ A,so A 
and B — A are disjoint. 

Ifw € ANB then w € B, by definition. If w € B—A then w € B, by 
definition. Hence if w € (AM B) U(B — A) then necessarily w € B. 

On the other hand, if w € B then either w € A or w ¢ A. In the first 
case w € AM B, and in the second case w € B—A. Thus in all cases 
w €(ANB)U(B- A). 

We have shown that B and (AN B) U (B— A) are the same set. This 
proves equation (2.10). 

Since AM B C A, and since we know that A and B — A are disjoint, we 
know that AM B and B — A are disjoint. Hence by additivity we obtain 


equation (2.11). 


Solution (Exercise (2.18). (i) For any events A,B, B-— AC AY‘, so 
monotonicity tells us that P(B — A) < P(A‘). When P(A) = 1, P(A‘) = 
1— P(A) =0, so P(B — A) = 0. 

(ii) For any events A,B, B = (BN A) U(B — A), and this is a disjoint 
union, so P(B) = P(Bm A)+ P(B- A). 
If P(A) = 1 then P(B — A) = 0 by part (i). 


Solution (Exercise |2.19}. There are 173 beans in the bowl. By Theo- 
rem the probability of picking a red bean is 53/173. 


Solution (Exercise |2.20). (a) There are 38 sample points, all of equal 
probability 1/38. Let A be the event. A contains 13 sample points, so 
P(A) = 13/38, by Theorem [2.23] 

(b) There are 380 sample points, all of equal probability 1/380. Let B be 


the event. A contains 130 sample points, so P(A) = 130/380 = 13/38, by 
Theorem [2.23 
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Solution (Exercise |2.21). Imagine choosing one bean at a time. 
As in Example |2.24| Q is the set of all pairs (b;,b2), where b; represents 
the first bean selected and bz presents the second bean selected, with bz 4 6y. 
There are 173 beans. There are 173 ways to choose the first bean, and, 
having selected the first bean, there are 172 ways to choose the second bean. 
Thus 
(3) == 173 3¢ 172, 
and i ; 
P = — 
U4) = 9 = Tex iT 


for every w. 

When choosing two red beans, there are 53 ways to choose the first bean 
and, having chosen the first bean, there are 52 ways to choose the second 
bean. Thus 


[Al 5d x G2, 
= 53 xX 52 
x 
Pa 
(f) Tas Tt 


Solution (Exercise |2.22). (a) There are 2*°° sample points w, each one 
of the same probability p. Hence 24°» = 1, so p = 274. 


(c) Let A; be the event that a head is obtained during tosses j, j+1,...,j+ 
19. This event is defined for 7 = 1,..., 381. 

Since the result of the other tosses is not specified, each set A; contains 
2389 sample points, and each sample point has probability 2~4°°. Thus 


P (A :) _ 93805 —400 _ 9-20 
i= = : 
Notice that we get the same probability for A,;, if we think of tosses j, 7 + 


1,...,7 +19 as a small experiment by itself. 
By the definition of A, A = A, U...U A3g;. By subadditivity, 


381 
Sa 0.00036335. 


P(A) < P(Ai) +... +P (Assi) = 5 


This is a small value. 
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Solution (Exercise |2.23). Assume it is known that for any events Aj, A, 


P (A, U Ag) < P(A;) + P(A2). (2.21) 
We will prove by induction that equation (2.14) holds for all k > 2. 
For some integer k > 2, suppose it is known that for any events Aj,..., Ax, 
k 
P(A, U...U Ax) < $> P(A). (2.22) 
j=1 


Let Aj,...,Axz41 be given. Define 
A=A,U---UAg. 
By the meaning of union, 
A, U...UA,U Agay = AU Apis. 
By equation (2.21), 
P(A, U...U Agyi) < P(A) + P(Ag1). 


Since we have assumed the truth of equation (2.22), we know that 


k 
< oP (A 
j=l 
Combining the last two equations, 
k 
PA Wie Aga) < 2 )+ P(Agis) = 5) P (Aj). 
Thus equation (2.14) holds ua k replaced by k + 1. 


By induction, Sthation (E14 2.14) holds for all k > 2. 


Solution (Exercise |2.24). To verify (A‘)° = A, note that by definition A‘ 
is the set of elements which are in the universe but not in A. 

By definition, (A‘)° is the set of elements which are in the universe but 
not in A°. Thus (A‘)° is the set of elements x in the universe such that the 
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statement “x is not in A” is false, i.e. the statement “zx is in A” is true. This 
shows that equation (2.15) holds. 


To verify that (A U B)© = ACN B*, note that by definition (A U B)° is the 
set of elements x in the universe for which the statement “x is in AU B” is 
false. That is (A U B)° is the set of elements x in the universe for which the 
statement “x is in A or x is in B” is false. 

Equivalently, (A U B)° is the set of elements x in the universe such that 
both of the two statements “x isin A” or “2 is in B” are false. Thus (A U B)° 
is the set of elements x in the universe such that x € A° and x € B°. This 
shows that equation holds. 


To verify that (A U B)© = ACN B*, note that by definition (A U B)° is the 
set of elements x in the universe for which the statement “x is in AU B” is 
false. That is (A U B)° is the set of elements x in the universe for which the 
statement “x is in A or x is in B” is false. 

Equivalently, (AM B)° is the set of elements x in the universe which are 
not in AN B. 

Thus (AM B)° is the set of elements x in the universe such that at least 
one of the statements “x € A’, “x € B is true. 

Thus (AM B)° is the set of elements x in the universe such that x € 
A°U B°. This shows that equation holds. 


One way to deduce equation (2.17) from (2.16) using (2.15), is to start 


by stating (2.15) with A replaced by A‘ and B replaced by B*. This gives: 
(APU Be)’ = (Al) ri By. 


By equation (2.15), 
(ASU B°)° = ANB. 


Hence 
(AN BY’ = ((A°U BY’) 


By equation (2.15), 


(ANB) =A UR 


One can make the last argument a bit more readable by noticing a simple 
consequence of equation (2.15): 


“Two events are equal if and only if their complements are equal.” 
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Solution (Exercise (2.25). Yep, these follow from the definitions. 

To prove that AC B = ANB=A, assume that AC B. 

Then if s € A then x € B. Hence x € ANB. 

On the other hand, if s € AN B then by definition x € A and x € B, so 
in particular x € A. 

We have shown that A= AN B. 

This proves the first equality. 

To prove that AN (B— A) = @, consider x € A. If x € B— A then by 
definition x is not a member of A, which is false. Thus « ¢ B— A. Since A 
and B — A have no members in common, AN (B— A) = 0. 


Solution (Exercise |2.26). Let x be a point in BM (A, U...U Ag). By 
definition, x € B and there is some index 7 such that x € A;. Then x € 
Br A;. Hence by definition ¢ € (BN A,) U... U(BM Ax). 


Let y be a point in (BN A;)U...U(BNM Ax). By definition, there is some 
index j such that y €¢ BN A;. Then y € B and y € A;. Hence y € B and 
y € A, U...U Ag, so by definition y€ BN (A, U...U Ax). 

We have proved that BN (A, U...U A;) and (BN A;) U... U(BM Ag) 
contain exactly the same points, so they are the same set. 


Solution (Exercise (2.27). We must show that: 
BU(ALN...N Ax) = (BUA) N... (BU Ax). (2.23) 
First proof: Let x be a point in BU(A,NM...M A;). By definition, this 
means that at least one of the following statements holds: 
(i) cE B. 
(ii) rE ALN... MN Ag. 


If statement (i) is true then x € BUA, for every j, and so by definition 
xe (BUA)N...A(BUAs,). 

If statement (ii) is true, then « € A; for every 7. Hence again we have 
x € BUA, for every j, so again x € (BU A))N...N (BUA). 

Thus in all cases, x € (BU A,)N...N(BUA;g). 


Let y be a point in (BU A1)N...N(BUAsg). 
By definition, for every index 7 = 1,...,k, y€ BUA. 
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If y € B, then statement (i) holds with x replaced by y. 

If y € B, then for every index j, y € A; must hold, since otherwise 
y € BUA, would be false. Hence in this case y € AjN...NA,, so statement (ii) 
holds with x replaced by y. 

We have shown that either statement (i) or statement (ii) holds, so y € 
BU(A,N...M Ax). 

We have proved that BU(AiN...M Ax) and (BU Ai) N... A (BU As) 
contain exactly the same points, so they are the same set. 

This proves equation (2.23). 


Second proof: Let C’ be any set, and let D,,..., D; be any sets. 
By equation (2.19), 
CM( 2p 220 D,) = (CV) UU (CO D,). 
Then 
(C re Us.3 UD)” SGN) Us. UIC Nn Dey: 


Using equations (2.17) and (2.16), 


COUN DW) .2e Dey = (CODY hf 1 (C ADR) 


Using equation (2.17), 
CUD VeacUD = (COU DF fase" UD). (2.24) 


Let C = B®, and let Dj; = Aj. By equation (2.15), C¢ = B and (D;)° = Aj. 
Thus equation gives equation (2.23). 

Incidentally, using equation oii, is correct, but we could express 
things in another way: since C' can be any set, C can be any set, and since 
D; can be any set, Dj can be any set. And so, in equation we can 
replace C° by any B and D§ by any Aj. 
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Chapter 3 


Models with continuous sample 
spaces 


Probability models come in many forms, and the theory of probability applies 
to all of them. Having a wide range of examples deepens our understanding 
of general properties. In this chapter we discuss models that have continuous 
sample spaces. These will used in applications later. 

Our discussion of the general principles of probability resumes in Chap- 
ter |4| Impatient readers can read through the present chapter quickly, and 
return later as needed. Exercises [3.1] [3.5] will be useful in seeing 


the main ideas here. 


3.1 Choosing a point in a continuous interval 


In this section we introduce a new class of sample space models. These 
models are more abstract than the simple models described in Section 
but the general properties of probability remain true. 

The particular model we discuss here has a sample space which is made 
up of an infinite number of points. And not just that: the sample space 
forms a continuous interval, meaning an interval with no gaps. 

Consider the physical experiment of choosing a location at random on 
a yardstick. Since a yardstick is three feet long, one might represent the 
yardstick as the interval [0,3] of the real line. We can then think of the 
experiment more abstractly as choosing a point in the interval [0,3]. The 
outcome is the point chosen, and the sample space 2 is simply the interval 
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[0, 3] itself. 


A point in Q is a real number, so evidently we have chosen to represent a 
physical point on the yardstick as a real number. However, specifying a real 
number means specifying an infinite number of decimal digits! For an actual 
physical location, this is very wasteful, since an infinitely precise description 
of position has no experimental meaning. Thus for a sample point w in [0,3] 
it seems that we can only use the first few digits from the decimal expansion 
of w, and the number w is a misleadingly precise description of a physical 
location. 


But this rather vague interpretation of w seems acceptable for practical 
purposes. We know that a mathematical interval is an extreme idealization, 
and no one could expect that [0,3] would match up perfectly with a real 
physical yardstick. 


Still though, since we acknowledge the imprecision in the interpretation, 
a reader may suspect that we are cluttering up the sample space with a lot 
of irrelevant details. Can’t we find a simpler model? 


As an alternative sample space, if we are satisfied by specifying positions 
with an accuracy of six decimal places, we could agree to conceptually divide 
the yardstick into subintervals of length 10~°, and just let the sample space 
(2 be a set consisting of integers that label these sections. That sample space 
would be less complicated mathematically, and we could adequately describe 
any location by simply stating which subinterval contains it. 


However, notice that using the interval [0, 3] as the sample space preserves 
much more of the geometrical setting for the experiment. And we will find 
that meaningful calculations of probabilities are actually clearer and more 
elegant if we use real numbers as sample points. So we will stick with using 
an interval of the real line as the sample space for this experiment, and for 
similar situations. 


Does that choice seem strange? It actually should not come as a great 
surprise that using a continuous interval of real numbers can make life easier, 
when modeling the physical world. Readers have likely already experienced 
the benefits of using the real line in calculus, to help solve problems about 
physical objects and physical processes. 


Now let’s think about how to assign probabilities to events, when the 
sample space is a continuous interval of the real line. 
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3.2 Probabilities of subsets of an interval 


Since we are going to be talking quite a bit about intervals, let’s state a 
definition, just to make sure we all agree on what we are talking about. 


Definition 3.1 (Intervals). An interval of the real line is defined as a subset 
of the form [a,}], or [a,6), or (a,b), or (a,b). A one-point set {b} = [b, db] 
counts as an interval too. 


As a warmup to defining probabilities, let’s look at what doesn’t work. 
What doesn’t work is equation in Theorem [2.16] The problem is that 
Theorem [2.16] dealt with the probability of an event A which is composed of 
a finite number of points. Until now, that has been the only situation that 
we have had to deal with, and it’s a nice situation, because the additivity of 
probability basically tells us everything we need. 

Theorem says that when there are only a finite number of sample 
points in A, then 

P(A) = 5) P({w}) (3.1) 
weA 

If you have a probability model with a finite sample space, every event is 
a finite set of points, so we can define the probability of all possible events by 
figuring the appropriate value of P({w}) for each sample point w. Equation 
then gives you the probability of any event A. Very convenient. 

But now we are in a new world. When the sample space is an interval of 
the real line, the sample space certainly contains an infinite number of points. 
Furthermore, for any single sample point w, the one-point set {w} seems 
rather useless all by itself when modeling the choice of a random location, 
since the idea of specifying a physical position with infinite precision is a 
fantasy. And if A is any set which contains only a finite number of points, 
the same argument suggests that A is not going to help in modeling real 
events either. So we cannot avoid dealing with events which are infinite sets 
of points. 

After thinking about it, it seems that when choosing a random point in 
an interval, the most useful events will be subintervals. If A = [u,v], then A 
is the event that the chosen point lies somewhere in the interval [u,v]. This 
event seems physically meaningful, at least if the length of [u,v] is not too 
small to measure. 
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So now we have a specific question to think about. Given a subset ( of 
the real line, and modeling the random selection of a point from 2, how can 
we define a probability distribution for intervals which are contained in (2? 


3.3 The uniform probability distribution on 
an interval 


We are considering randomly choosing a point from a subset 2 of the real 
line. We take the sample space to be Q, so that any point x in 2 represents 
the outcome that x is the chosen point. Events are sets of outcomes, so 
events are subsets of the 2. 

Consider a special case. Let’s add the assumption that the random point 
will be chosen from Q, in such a way that no point of Q is favored. We’ll 
assume that Q contains an interval which has positive length. 

How should we define probabilities in this case? 

Physically, it seems clear that a long interval is more likely to contain 
that chosen point than is a short one. Building on that insight, it seems 
reasonable to make a specific mathematical assumption: 


The probability that a point lies in a subinterval A of 2 
should be proportional to the length of A. 


This means that there is some constant c such that for any subinterval A of 
Q 


’ 


P(A) = clength(A). (3.2) 


Definition 3.2 (Uniform distributions on subsets of the real line). 
Let Q be a subset of the real line which is an interval, or the union of a finite 
number of intervals. 

Let P be a probability distribution on 2 such that for some constant c, 
equation holds for every subinterval A of Q. 

Then we say that the probability distribution P is the uniform distribu- 
tion on Q). 


When using Definition how do we find the constant c in equation 


(3.2)? 
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For simplicity, let’s assume that Q is an interval. 
We also must have P(Q) = 1, so equation (3.4) tells us that 


clength(Q) = 1, 


i.e. i 
“~~ Tength(Q)’ oe) 
We conclude that when A is a subinterval of Q, 
_ length(A) (3.4) 


a)= length(Q) ° 


Exercise 3.1. When Definition holds, and Q is not a single interval, 
but is the union of several disjoint intervals B,,...,B,, what is the correct 
formula for P(A), instead of equation (3.4)? 


In an experiment, if you want to describe properties of positions, using 

a ruler or some other measuring device, you will likely describe one or more 

subintervals which are ranges specified by your measurements. Thus a typ- 

ical event when choosing a random point seems likely to be a finite union 

of intervals. If A denotes such an event, then there are disjoint intervals 
digits dg dng, Suet that 

AS T.ccl de. (3.5) 


See Figure Mathematically, other events are certainly possible, but we 
don’t need to consider these at the moment. 


Remark 3.3 (One-point events never happen!). In the situation of 
Definition [3.2| let w be a point of Q, and let A = {w}. 
Since A = [w,w], by formula (3.4) we have 


_ length((w, w]) 
ne) length((Q) 


Event A never happens! 
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fA A i i 


Figure 3.1: A= ii U To U Ts U I. 


On the other hand, every time you perform the experiment, some location 
is chosen. So some one-point event always happens! 

This sounds like a nasty paradox. To see that there is really no problem, 
let’s look at an analogous story about length. 

Every one-point set has zero length, right? So the unit interval is made 
up of sets which have zero length. But the good old unit interval has length 
equal to one. 

Is that a paradox? Well, the length of the unit interval is definitely not 
found by adding up the lengths of the points that compose it. So the length 
story seems ok. 

Let’s go back to the experiment of choosing a point. Let c denote a 
particular point. Suppose that you randomly choose a point every second of 
every day for a trillion years. What’s the probability that the point c will be 
one of the points that is chosen during that period? The correct answer is 
zero, right? 

We should keep in mind that the real line is an abstraction, and points 
of the real line are not physical objects. We can often think about models as 
if points of the real line are physical objects, but it’s not so. 


Incidentally, when one-point events have zero probability, the probability 
of an interval does not depend on whether of not we include the endpoints. 
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3.3. The uniform probability distribution on an interval 


That is, when choosing a random point, 
P((a, ]) = P(|a, 6)) = P((a, 8]) = P((a, 4)). (3.6) 


Why is that true? By additivity, 


Exercise 3.2. A certain factory makes special telephone cables. Occasion- 
ally defects occur in the pieces of cable which are produced. The defects are 
rare, and seem equally likely to occur at any point in a cable. 

There are four communication centers, A, B,C,D. They are connected 
using cables of the sort just described. One cable runs from A to B, another 
from B to C, and a final cable runs from C' to D. Cable AB is 3 miles long, 
cable BC is 4 miles long, and cable C'D is 2 miles long. 

After the cables are installed, the staff discovers that a signal is unable to 
pass from A to D via the three cables. However, a signal passes successfully 
from B to C. 

Assuming that there is only one defect in the three cables, the defect 
must lie in either the cable from A to B or in the cable from C to D. Find 
the probability that the cable from C’ to D is the one with the defect. 


Exercise 3.3. A certain street is 600 feet long. Sam lost his lucky penny 
somewhere along this street. He knows he lost it there, but has no idea in 
what part of the street it has fallen. 

His friends Alice, Bob and Clancy decide to search for for Sam’s coin. 
Alice searches the first 300 feet, Bob searches the next 200 feet, and Clancy 
searches the final 100 feet. The searchers are careful, so they will not miss 
the coin. 


(i) Let A be the event that the coin is located in the interval that Alice is 
searching. Find P(A). 
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(ii) Suppose that we learn the following additional information. After five 
minutes, the coin has not yet been found. Alice has already searched 
two-thirds of her section. Bob has searched half of his section, and 
Clancy has searched three-quarters of his section. 


Let A be the event that Alice eventually finds the coin. Based on all 
the information we now have, find the probability of A. 


This part of the problem is an example of a conditional probability 
calculation, and we have not yet covered conditional probability. How- 
ever, you can solve this problem by building a new model, with a new 
sample space. 


3.4 Probability densities on intervals 


Let 2 be an interval of the real line, or perhaps a finite union of intervals. 

We will think of 2 as part of a model for the experiment of choosing a 
random point. Of course we have to have a probability distribution defined 
too. So far we have talked about uniform distributions. But that might not 
match the physical conditions of the experiment. It might be that the point 
is more likely to be chosen from one region rather than another. 

A probability distribution which is not uniform can be represented by 
using a probability density function which is larger in some regions and smaller 
in others. 

As the name suggests, a probability density which is defined on a portion 
of a line tells us the “probability per unit length”. 


Definition 3.4 (Probability densities). A probability density f is a func- 
tion such that 


(i) f is nonnegative, and 
(ii) the integral of f over Q is equal to one. 


If P is a probability distribution such that 


P((a,8) = f {(@) as, (3:0) 
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for every interval [a,b] which is contained in 2, then we say that f is a 
probability density for P. 


In calculus, ie f(x) dx is usually referred to as “the integral of f over the 
set [a, 6)”. Equation says that the probability of [a,b] is given by the 
integral of the probability density over the set [a, 0]. 

Conditions (i) and (ii) in Definition [3.4] are needed because probabilities 
are nonnegative, and because we must have P(Q) = 1. 

The integral of a function over a single point is of course equal to zero, so 
when a probability distribution has a density, equation holds. So when 
there is a density we don’t need to be fussy about endpoints of intervals: 


P((a, 6]) = P([a, b)) = P((a, 6]) = P((a,6)). (3.8) 


0.5 1.0 1.5 2.0 2.5 3.0 


Figure 3.2: The probability of choosing from a set is the integral of the 
density over the set. 
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Exercise 3.4. Let 2 = [0,3]. Let f be a probability density on Q which is a 
multiple of e~*. A point is chosen at random from [0,3], with a probability 
distribution given by the density f. Find the probability that the point is 
chosen from the interval [1,2]. See Figure 


Exercise 3.5 (Constant densities give uniform distributions). Let 2 
be a subset of the real line which is the union of a finite number of disjoint 
intervals. Let f be a constant function on 2 such that the integral of f over 
(2 is equal to one. 

Let P be the distribution with density f. Prove that P is the uniform 
distribution on , in the sense of Definition 


A general form of Definition will be given in Definition The 
general definition applies to a wide range of sample spaces, not just the real 
line. 


3.5 Cleaning up integral notations 


Formulas are easier to understand if we simplify the notation. 

For example, in equation (3.7), unless we want to show the formula for f 
explicitly, there is really no need to write f(x) dx in the traditional calculus 
manner. It is clearer to just write 


P((a,b]) = / f. (3.9) 


Even this notation can be improved. The best notation for the integral of f 
over a general set A is: 


integral of f over A = | de (3.10) 
A 
Thus equation (3.9) becomes 


P([a,b])= [ff (3.11) 


3.5. Cleaning up integral notations 


Actually, if equation (3.11) holds for all intervals [a, b], then for any event 
A it is true that 


P(A) = / f. (3.12) 


This form is convenient, but what does it mean, when A is not an interval? 

Presumably we know what the event A means, or we wouldn’t be talking 
about it. And so we know what P(A) means. But what about [, f? 

The concept of integrating a function over a set actually makes sense for 
lots of sets, not just sets which are intervals. 

To see this physically, think about a wire whose mass density might vary 
along the wire. The mass of any part of the wire is found by integrating 
the mass density function over that part of the wire. This makes sense even 
if the part of the wire that you are interested in consists of many separate 
pieces. Just find the mass of each piece (by integrating the mass density) 
and then add up the masses. 

In calculus this is how we can find the integral of a function f over a set 
A, if A is the union of two disjoint intervals [a,b] and [c, d]: 


fr-far fs (3.13) 


For example, if we have to integrate a function which has a different formula 
on different parts of an interval A, we often calculate the integral over A as 
the sum of the integrals over the separate parts. 

For general sets we can do the same thing. We can find the integral over 
a set by integrating over the pieces of the set, and then adding up the results. 
We will call this the additive property for integration over a set. 

And notice that the additivity of integration over sets is exactly what we 
need if equation is used to define a probability distribution. After all, 
if A = D,UDg», where D,, D2 are disjoint events, the additivity of probability 
says we must have P(A) = P(D,) + P(D2). That probability equation can 


only true if: 
= ; 3.14 
frafrefi (3.14) 


And that’s additivity for integrating over a set. 
So we know how densities define probabilities, and we know how integra- 
tion over a set works. That’s all we need to understand probability densities. 
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But let’s nail this down by giving a nice general definition of the process of 
integrating over a set. 

Our definition should capture the idea that the integral of f over a set A 
is the integral using the values of f on the set A, and nothing else. Using that 
idea, it is pleasantly simple to give a general definition of [ ad, as follows. 


Definition 3.5 (Integration over a set). Let f be a function and let A 
be a set. 
Define a new function g as follows: 


i if x eA, 


0 otherwise. 


[=f (3.16) 


We might say that g in equation is formed by discarding the values 
of f on AS. 

Does equation follow from Definition [3.5? Sure, and it’s a pleasant 
exercise. See Example |11.4] for a nice way to write down the argument. 


(3.15) 


Then by definition 


Incidentally when it comes to actually computing an integral over a set, 
which notation you use won’t make much difference. But the modern [ is 
notation seems clearer for thinking. 


Remark 3.6 (Intervals characterize densities). Let (2 be an interval of 
the real line, or perhaps a finite union of intervals. 

Suppose you are studying a distribution P on 2, and you come up with 
a function f such that P([s,t]) = deat f for every subinterval [s,t] of the 
sample space. Then by definition f is a valid density for P. Does that mean 
that for any event you can go ahead and calculate P(A) using P(A) = J, f? 
One would hope so, and happily that is actually true! Very convenient. We 
won't write down a formal proof, but it illustrates a general principle: there 
are lots of intervals, and knowing that an equation is true for intervals is 
good evidence that it holds in general. 

We'll return to this subject in Remark [9.11] 
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More examples of densities on the real line are given in Section 


3.6 Choosing a point in R?: throwing darts 


~ 
| 
ee 


Figure 3.3: An event on the dart board 


Consider the experiment of throwing darts at a dart board. Assume that 
the thrower is rather inaccurate, so the point where the dart hits is random. 
(If the thrown dart misses the target completely, we will ignore that throw, 
and consider that the experiment did not occur.) 

The outcome of the experiment is the point of impact, i.e. the location 
at which the dart hits the board. We can represent this outcome as a point 
on an idealized copy of the dart board, which we take to be a region called 
T in R?, where R? is the set of all coordinates (21,272) in the plane, i.e. R? 
is the set of all pairs of of real numbers. 

The dart board region T is our sample space 2. An event is then simply 
a sub-region of the dart board region. See Figure |3.3] for a picture of 2 and 
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an event A. 

If the thrower under consideration is very inaccurate, for simplicity we 
might assume that every part of the target has the same chance of being 
hit. In that case it seems natural to assume that the probability of hitting 
a particular region on the target is proportional to the area of the region. 
Since the probability of hitting the whole target must be 1 (remember that 
we disregard any throw that hits elsewhere), this means that the probability 
of hitting some region A of the dart board is given by: 


area(A) 
area(Q)) | 


Like the formula for subsets of the real line which was given in equation (3.4), 
equation (3.17) is a continuous analog of Theorem [2.23] 


Definition 3.7 (Uniform distribution on a region in the plane). Ifa 
probability set-function P is defined on subsets of a region T’ of the plane, 
and is such that probability is proportional to area, it will be said to be a 
uniform probability distribution on T. 


P(A) = (3.17) 


Example 3.8 (Probability of missing the central region). Someone is 
throwing darts at a target represented by a disc of radius 5, centered at the 
origin of R?. 

The point of impact (x,y) is random, with a uniform distribution on the 
target. 

Let A be the set of points (x,y) in the target such that \/x? + y? > 2. 

Since ,/a2? + y? = r, the distance of the point (x,y) from the origin, A 
represents the physical event that the dart lands more than two units of 
distance from the center. See Figure 

Let us find P(A). 

Since the probability distribution is uniform, 


area(A) 
P(A) = —— 
So area(T')’ 
where T = ( is the target region, and of course area(A) = area(T’) — 
area(A°) = 25a — 4a = 217. Thus 
21 
A)=—. 
(A) = 55 


3.6. Choosing a point in the plane: throwing darts 


Figure 3.4: A is the event that the dart misses the center region 


The next two exercises are mostly a test to see if you can still calculate 
areas. It’s ok to skip them, as long as the statements of the questions make 
sense to you. 


Exercise 3.6. Let 2 be the rectangle consisting of all points x,y such that 
O<a2<2and0<y< 5. Let P be the uniform probability distribution on 
Q, so that this sample space and distribution form a model for choosing a 
point at random from 0). 


(i) Let A be the event that the chosen point (x,y) is such that x < y. Find 
P(A). 


(ii) Let B be the event that y < 4— 2x. Find P(AN B). 
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Exercise 3.7. Someone is throwing darts at a target represented by the unit 
disc. The point of impact (x,y) is random, with a uniform distribution on 
the target. 
Let A be an event defined in terms of the height of the point of impact: 
A is represented by the set of points (x,y) in the target such that 5 < 
YS x. 
Find P(A). 


Just as in the case of an interval, we can easily define probability densities 
for regions in the plane. Two-dimensional integrals take more calculation 
than one-dimensional integrals, but the ideas are the same. 


3.7 More examples of densities 


Readers likely don’t have an urgent need for more examples at the moment. 
But it may be enlightening to glance over the examples here, and return 
later, when using densities in later chapters. 


Example 3.9 (A uniform density on an infinite interval?). Much as 
in Example consider a constant probability density f on an infinite 
interval, like [0,00) for example. Does such a density make sense? 

Let k be the constant value of f. k is a nonnegative number since f is 
a density. Since [5° f = fo f = P(Q) = 1, k cannot be zero. On the other 
hand, since 

12> P02) = f= nu; 
[0,n] 

for every n, we are forced to conclude that k must be zero. This contradiction 
shows that a constant probabiiy density on an infinite interval does not exist. 


A nonconstant probability density on an infinite interval is certainly pos- 
sible, as the next exercise illustrates. 
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1.0 5 


0.8 


0.65 


0.44 


0.27 


1 2 3 4 


Figure 3.5: f(z) =ca-** 


Exercise 3.8 (The exponential density). Let be a positive constant. 
Let f(x) = ce~*” for x > 0, f(x) = 0 otherwise. Assuming f is a probability 
density on R, find c. 

See Figure [3.5] 


Exercise 3.9. Let a and £ be positive constants. Let f(x) = ce~° for 
x >0, f(x) = ce? for x < 0. Assuming f is a probability density on R, find 
G 


See Figure 


Exercise 3.10. For the experiment of choosing a number from the interval 
(0, 3], suppose that points near 0 are more likely to be chosen, specifically that 
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2 =1 1 2 3 4 
Figure 3.6: a= .8, 68 =1.3 


the probability set-function is given by a density f of the form f(z) = c(3—2), 
where c is some constant. 


(i) Find c. 


(ii) Calculate the probability that selected number is less than 1/2. 
See Figure 


Exercise 3.11. Consider a probability model with sample space Q equal to 
[0, 4] and probability density f(t) = t. 


(i) Check that f is a probability density. 

(ii) Let P be the probability set-function with density f. Suppose that a 
random number ¢ is selected. Let A be the event that (¢ — 1)? > 2. 
Find P(A). 
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1.05 


0.84 


0.5 1.0 1.5 2.0 2.5 3.0 


Figure 3.7: The probability of choosing from a set is the integral of the 
density over the set. 


See Figure 


Exercise 3.12. Consider the probability model in Exercise 8.11] Using the 
P with density f, let t be the randomly selected point. Let A be the event 
that 2t — 2 < (t —1)?. Find P(A) 

See Figure 

(You finally get to use equation in a situation where A is not an 
interval!) 


Remark 3.10 (To what extent is the probability density unique?). 
The purpose of a density f is to define a probability set-function P. 
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2:57 
(t—1)° 

Pe er ee eee eet rt fsorveventiaomttnen oieteitioen 
15-4 


1.0 


0.55 


0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 


Figure 3.8: The probability of choosing from a set is the integral of the 
density over the set. 


If one just changes f at few points, it makes no difference in any integral 
involving f, and so it makes no difference in the definition of P. So such a 
modified function works, i.e. is a correct probability density for the probabil- 
ity set-function P. It is just as valid as the original function f, even though 
it may have a more cumbersome definition. 

We might think about the density f as a kind of probability machine. 
One turns the crank on this machine (i.e. integrates f) to get a probability. 
That is the sole purpose of f, its raison d’étre. Any other function h such 
that [,h = J, f for all events A also deserves the honor of being called a 
probability density for P. 
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Figure 3.9: The probability of choosing from a set is the integral of the 
density over the set. 


3.8 Solutions for Chapter 


Solution (Exercise |3.1). Since Q = B)U...UB,, and the sets By,..., By 
are disjoint, we know by additivity that 


P(Q) = P(B,) +... + P(B,). 


But P(Q) = 1, and by equation we know that P(B;) = clength(B;). 
Thus 
1 = length(B,) + ...+length(B;,), 
SO 
a 1 
= length(B,) +...+ length(B;)- 


Replacing c in equation by this value gives the correct form of equa- 
tion (3.4): 

length(A) 
~ length(B,) +...+length(B;)’ 


P(A) (3.18) 
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Remark 3.11. You could extend the definition of length to include sets 
which are not intervals. So in the situation of this problem, when 2 is the 
union of disjoint intervals B,,...,B,, we could agree to define 


length(Q) = length(B,) + ...+ length(B;). 


Then equation (3.4) would still be valid. 


Solution (Exercise [3.2). Choose numbers a, b,c, d such that the length of 
[a, b] is equal to the length of the cable from A to B, and the length of [c, dj 
is equal to the length of the cable from C' to D. Choose the numbers so that 
the intervals [a,b] and [c,d] are also disjoint. 

Let 2 = [a, }] U [c, d]. 

We can think of a point in one of the intervals [a, }], [c,d] as a position 
coordinate which describes the possible location of the cable defect. 

Let P be the uniform probability distribution on 2. 

Let H represent the event that the defect lies in the cable from C' to D. 
Then H = [c,d]. 

By equation (3.18), 


length(/) d—c 2 2 


ee ae length((a,b]) + length((c,d)) (b—a)+(d—c) 342° 5 


Solution (Exercise (3.3). Let Q be the union of disjoint intervals U,V.W, 
where length(U) = 300, length(V) = 200, and length(W) = 100. 
We can think of a point in one of the intervals U,V,W as a position 
coordinate which describes the possible location of the lost penny. 
Let P be the uniform distribution on Q. 


(i) The abstract event representing A is U. 


— length(U) — 300 
~ length(Q) — 600 


P(U) 


(ii) Let U,V,W be the unsearched parts of U,V,W, respectively. We will 
assume that these unsearched parts are intervals. (If the unsearched parts 
were made up of many pieces, the same method would work, it would just 
take longer to write down.) 
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Then U,V,W are intervals with length(U) = (1/3)length(U) = 100, 
length(V) = a /2)length(U) = 100, length(W) = (1/4)length(W) = 25. 

Now let the sample space be 2 = UUV UW. This represents the 
unsearched road. 

A point in 2 represents the possible position of the missing coin, in the 
unsearched road. 

The original description of the problem gives no reason to treat any sec- 
tion of the whole road differently from any other section. The decision about 
where to search for the coin does not seem connected in any way to the actual 
location of the coin. So it seems that there is still no reason to treat any 
section of the unsearched road differently from any other section. 

So the appropriate distribution for the location of the missing coin is the 
uniform distribution on Q. Let P be the uniform distribution on 2. 

In this model, the event that Alice eventually finds the coin is U. Using 
equation (3.18), 


length(U) _ 100 _4 
length(U) + length(V)+length(W) 100+100+25 9 


P(U) = 


Solution (Exercise |3.4)). The requested probability is P((1, 2]), where the 
density for P is given by ce~* on [0,3], for some constant c. 


Since P(Q) = 1, 
3 
| co dr =I, 
0 


6) 
3 
1=-ce*| = c(1 = eo) 
0 
Thus 
_ 1 
fae 
Thus 
7 3 e-1 — e-? 
P((1, 2]) = / ce “dx =—ce*| = c(e~* = eo) =——_. 
1 1 l-—e7 


Solution (Exercise |3.5)). We are told that f is a constant function. Let c 
be the vaue of f. 
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Let A be any subinterval of 0. Then 


[of 


From calculus we know that integrating c over an interval A gives clength(A). 
We are given that f is a probability density for P. Thus P(A) = J, f. 
We have shown that P(A) = clength(A) for every subinterval A, 

By Definition P is the uniform probability distribution on Q. 


Solution (Exercise (3.6). (i) The set A° = {(2,y): y<a}NnQisa 
triangle with base 2 and altitude 2. Hence its area is (1/2)4 = 2. The set 
Q is a 2 x 5 rectangle, so its area is 20. Thus P(A‘) = 2/10 = 1/5, and so 
P(A) =4/5. 


(ii) The line y = 4 — 2z crosses the line y = x at the point (4/3, 4/3). 

Let’s find the area of (AM B)*°. The integral of a function of the form 
ma + 6 over an interval is equal to the length of the interval times the value 
of the function at the midpoint. Thus 


area((ANB)°) = [- (4 — 22) art [ete = : (4-2 @) ! : () = = 


Hence 


area(ANB) 2 16 8 
P(AN B) = ~3=——-__ 
ene) area((2) 10 30) 115 


Solution (Exercise 3.7) . Let My = { (2, yy > al NQ, and let M_ 


{ (2,9) ae -4,} NQ. 
Then A = 92 —(M,U M_), so area(A) = 7 — area(M,) — area(M_) = 
a — 2 area(M,). 
Let S be the square with vertices (ss; 5) (ss; =e) (= ss —Z5); (=s53 75): 
Then Q — S is the union of four regions, /, and three other regions with 
the same area as M,. Hence 


am —4=4 area(M,), 


6) 
a—A 


area(M,) = 
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Thus 4 
area(A) = 7 — 2 area(M,) =a — — - 542 
Hence 2 
P(A) = area(A) _ ae _i 2 
area(Q) 1 2” 
Solution (Exercise |3.8)). Since 
lee) 0 ee) ‘ oo ‘ C 
(e = Od 8 de =e—| eO* = = 
[fs cf ste x ee 7? 
we must have c = X. 
Solution (Exercise |3.9)). Since 
° oa . CC 
dp =e | ert eS, 
Blo. a £6 


“ 1 
i= f tae f cedete [ e€ 
Q 0 —oo TAG 


we must have 


Note that the answer here agrees with the answer to Exercise if we 


set 4 = a and let 6 — oo. Why should that be the case? 


Solution (Exercise |3.10). (i) 
3 _ m)2]3 
1=P@)= | Gai tiene” ) = 
; a ae 
This c= 2/9: 
(ii) The probability is 
1/2 2 [2 1/2 9 — Gy vt 
dx =— | (83-—a2)de =-—~(3-2)?| = a 
[ teoar=; f @-nde=-Ge-0'| = = 5 


Solution (Exercise |3.11}). (i) f is clearly nonnegative on 2, so we only 


need to check that J, f = 1. 
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a 1 
[r-[ —tdt = —#? 
6 « 8 16 


(ii) We need to write A more explicitly. 

So we must rewrite the inequality (¢ — 1)? > 2 in an appropriate form. 

Ift >1, then t—1> 0, and (t—1)? > 2 is equivalent to t-—1> V2, ie. 
t>1+4+ v2. 

Ift <1, thn t—1<0,s01—t> 0, and (t — 1)? > 2 is equivalent to 
L=t S426 tei, 


Combining these statements with the fact that 0 < t < 4, we see that 


4 42 


=—=1. 
» 16 


Thus 
a | t2 |* i! 2 
P(a) = f “tdt = — = 7 (19 (1+v2)") 
ee 16|,,/5 16 


1 1 
= — (16- (1+2v2+2)) == (13-2v2). 
= (16 - (1+ 2V2+ 7 v2 
Solution (Exercise |3.12}). We could consider cases, as in the previous 
problem, but perhaps it’s faster to rewrite the inequality which defines A in 
a different way first. A is the set of ¢ such that (f — 1)? — 2t+2> 0. 
This inequality says t? —4¢+3>0,ie. ((-1)(t-3)>0. 
The polynomial (t—1)(t—3) is zero at t = 1 and t = 3, positive for t < 1, 
positive for t > 3, and negative otherwise. 
Hence A = [0, 1] U[3, 4], and so 
1 4 1 
1 1 1 
P(A) = | <=tdt —tdt=— [| +0? 
a [ are / 8 16 ( 


4 


1 8 1 
= —(1-0+16-9) = = =<. 
; ig ee oe 
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Conditional probability 


4.1 Conditional probability defined 


Consider any experimental situation, and any possible events A, B for this 
experiment. Suppose that, based on your knowledge of the experimental 
setup, you know the value of probabilities such as P(B) and P(A). 

Now suppose the experiment has been performed. Although you do not 
yet know the result, someone tells you that the event B did occur. This 
extra knowledge, combined with what you already knew, gives you a new 
experimental situation, and a new probability for the event A. We call this 
new probability “the conditional probability that A occurred given that 
B occurred”. This probability value is written as P(A| B). 

How do you find P(A| B)? 

As a simple example, we can think about the experiment of throwing 
darts at a dart board, described in section The throw takes place, but 
we are not looking. We ask someone a specific question: “Did the dart land in 
region B?” (See Figure [4.1}). The answer is “yes”, so we have one additional 
piece of information about the experiment. We do not have other additional 
information. 

The question is: what probability should we now assign to the event that 
the dart landed in region A? 

It should be emphasized that conditional probabilities are not different 
from any other probabilities. Every probability is conditional on some infor- 
mation! Mathematicians use the word “conditional” here merely to empha- 
size the way in which your knowledge has changed from what you started 
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Figure 4.1: Events on the dart board 


with. 
We are currently thinking about physical probabilities for an actual ex- 
periment. There is simple formula for the conditional probability . 


Fact 4.1 (The conditional probability formula). Let A and B be phys- 
ical events for some experiment. If P(B) 4 0, 


P(A|B) = ne (4.1) 
or equivalently 
P(AN B) = P(B)P(A[B). (4.2) 


IZ 


4.2. Why the conditional probability formula holds 


If you construct a probability model for the experiment, you will define 
probabilities for the abstract events in your model. The probabilities P(A) 
in a valid model will be the appropriate probabilities based on your initial 
knowledge. In the case of a probability model the conditional probability 
formula holds by definition, but the definition must follow the physical rule. 


Definition 4.2 (The conditional probability formula). Let A and B be 
events in a model for some experiment. If P(B) 4 0, the P(A| B) is defined 
by equation (4.1). 


Both equation and equation are useful forms of the conditional 
probability formula. We might call equation the “mutiplied-through” 
form of the conditional probability formula. 

Much of our real-world knowledge about people or things can be regarded 
as approximate forms of conditional probability assessments! For example, 
if we are getting ready to deal with a person’s possible reactions to some 
situation that might occur, we may be using a thought process analogous 
to (4.2). B would represent the event that the situation occurs, and A would 
represent the event that the person reacts in a particular way. 


4.2 Why the conditional probability formula 
holds 


Like additivity, the conditional probability formula is a fundamental rule in 
probability. In this section we will spend some time justifying this formula. 

To show that equation (4.1) is correct physically, think about events A, 
and A», which are subsets of B, and are such that P(A;) = P(A2). This 
implies that, based on everything you know initially, these two events are 
equally likely. 

If someone tells you that B occurred, does that extra piece of information 
give you any reason to believe that one of the subsets A,, Ag is now more 
likely than the other? It is hard to see how that could be the case. The extra 
information does not treat either event differently. You may know many 
ways in which events A; and Ag, differ from each other physically, but you 
already knew that when you initially decided that A; and Ag had the same 
probability. 
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Based on such considerations, it seems plausible that for events A which 
are subsets of B, P(A| B) should be proportional to P(B). That is, for some 
constant c, 

P(A|B) =cP(A) (4.3) 
for every event A which is a subset of B. 

Applying equation with A = B, we see that P(B| B) = cP(B), and 

of course P(B| B) = 1! Thus c = 1/P(B), and so we have 


P(A| B) = (4.4) 


for every event A which is a subset of B. 

What about events A which are not subsets of B? Well, B and B° are 
mutually exclusive, so P(B°| B) = 0. For any event A, since AN B® Cc BS 
we know that P(AN B°| B) =0. 

By additivity we have 


P(A| B) = P(ANB|B)+P(ANB*|B), 


P(A| B) = P(ANB|B). (4.5) 
By equation (4.4), 
P(AN B|B) = ae (4.6) 


Applying equation (4.6) to equation (4.5) gives us equation (4.1), which is 
the general conditional probability formula. 


We’ve given one justification for the conditional probability formula. Now 
let’s give another, this time using frequencies. 

The frequency interpretation applies to conditional probabilities, just as 
it applies to all probabilities. So we can use the frequency interpretation 
(Probability Fact to derive a formula for P(A |B). 

The experimental situation that we are interested in now is the situa- 
tion in which the original experiment was performed and B occurred. The 
frequency with which A occurs in this experimental situation is the right 
approximation to P(A| 8B). To get this frequency, repeat the original exper- 
iment many times, say N times, but only record results for those times when 
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the physical event B occurs. The fraction of those recorded times for which 
the physical event A occurs will give us a good approximation to P(A| B). 
We are assuming that N is large. Suppose that during the N repetitions 
of the experiment, the physical event B occurred M times. 
By the frequency interpretation for the unconditional probability, we 
know it is likely that 


M 
=; © P(B). (4.7) 


We are only interested here in the case that P(B) > 0, so that B can happen. 
When P(B) > 0, equation tells us that M will be large when N is large. 
Suppose that during the M times that B occurred, the physical event A 
occurred L times. 
By the frequency interpretation for the conditional probability, it is likely 
that 


L 
u®™ P(A|B). (4.8) 
Thus 7 
P(A|B) x ae (4.9) 
N 


Let’s look at the fraction L/N. L counts the times when B occurred 
and A occurred. Thus, by the frequency interpretation for the unconditional 
probability, we know it is likely that 


L 
— = P(AN B). 4.1 
© = P(ANB) (4.10 
Apply equations (4.7) and (4.10) to equation (4.9). This allows us to 
conclude: P(ANB) 
P(A| B) = ———— 4.11 
(4| B) = PE. (4.11 


Equation (4.11) is based on approximations that become more and more ac- 
curate as the number of trials increases. Thus the approximation in equation 
(4.11) tells us that P(A| B) and ea must be equal, and so equation (4.1) 
holds. 


4.3 Using the conditional probability formula 


Exercise 4.1. In the experiment described in Exercise |2.19} let A be the 
event that the selected bean is yellow, red or green. Let B be the event that 
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the selected bean is red, purple or green. Find P(A | B), using the conditional 
probability formula. 


Exercise 4.2. In the setting of Exercise |2.21| when the experiment consists 
of choosing two jelly beans in succession, let B be the event that the first 
bean chosen is red. 


(i) Find P(B). 


(ii) From the description of the experiment, R is the event that both beans 
chosen are red. Thus P(R| B) is the probability of choosing a a red 
bean in the second selection, from the remaining beans, after the first 
selection resulted in a red bean. Use this observation to find P(R| B) 
by applying Theorem |2.23]to an appropriate population, without using 
the conditional probability formula. 


(iii) Alternatively, find P(R) using the multiplied-through form of the con- 
ditional probability formula, equation (4.2). 


Check that your answer agrees with the probability found in Exer- 
cise [2.21 


Exercise 4.3. Let A, B be events with P(B) 4 0. Show that 
P(A|B)=P(ANB|B). (4.12) 


Just for fun, the next exercise takes the “fraction of a fraction” idea to a 
new level. 
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Exercise 4.4 (Telescoping conditional probabilities). Simplify 
P(A)P(B|A)P(C| AN B)P(D| AN BNC). 


It is assumed that P(A), P(AN B), and P(AN BNC) are nonzero. 


The mathematical properties of conditional probability mirror the way 
we think. The next exercises illustrate this. These exercises do two things: 
they give you practice in manipulating conditional formulas, and they test 
whether the assumptions of probability are well-chosen. 


Exercise 4.5 (Conditioning on an additional event). Suppose that 
B,C are events for some probability model. Suppose that P(B) 4 0. For 
any event D, define Q(D) = P(D| B). This is just to simplify notation. 

When using Q as your distribution, the fact that B occurred is “built 
into” your probability model. 

Assume that P(BNC) £ 0. 

Show that Q(C) 4 0, and that for any event A we have 


Q(A|C) = P(A| BNC). (4.13) 


The event BMC is the event that B occurred and C' occurred. Thus 
Equation is exactly what we expect from the idea that a conditional 
probability uses additional information, since in calculating Q(A|C) we are 
adding still more information to the extra information that we already used 
to calculate Q(A). 


Exercise 4.6 (Conditioning on stronger information). Suppose that 
B,C are events for some probability model. Suppose that P(B) 4 0. For 
any event D, define Q(D) = P(D| B). 

Let C' be an event with P(C) > 0, such that C is a subset of B. 

Show that Q(C) 4 0, and that for any event A we have 


Q(A|C) =P(A|C). (4.14) 
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Exercise 4.7 (Ignore events that you knew had to happen). Let A, B 
be events in some experiment. 
Let C be an event such that P(BNC) 4 0. Show that 


PANC|BNC)=PA|BnC). (4.15) 


4.4 ‘Total probability 


The name of the next theorem seems a little pretentious, since the statement 
is a simple consequence of additivity and the conditional probability formula. 
However, breaking a problem up into cases is a fundamental technique, and 
is frequently used. 


Theorem 4.3 (The Law of Total Probability). Let D,,...,D, be dis- 
joint events with union D, and let M be an event. Then 

k 

P(MnD) =) > P(D,)P(M|Di). (4.16) 

i=1 
In this equation, it appears that we must assume that P(D;) > 0, so that 
P(M | D;) will be defined. However, we can use the equation in all situations, 
with the following convention: if P(D;) = 0 then we simply interpret the 
whole term P (D;) P (M | D;) as zero. 


When applying this theorem, we typically look for cases D; where we 
know P (M | D;). So we break up problems into simpler parts. 
Often the event M is such that MC D,U...UD,. Then MN D= M, 


so equation (4.16) becomes 
k 


P(M) = 5S°>P(D;)P(M|Di), (4.17) 


i=1 
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Proof. We checked in Exercise that intersection distributes over union, 
so we know that 


k 
MnD=|(JMnD,. (4.18) 
i=1 
(Actually, one likely doesn’t even think about “the distributive property” 
when writing down equation (4.18). Instead one can just think about cases, 
i.e. think about the possible ways that IMM D can happen: the possible ways 
are MN D,,..., MM Dx.) 
Since D,,..., Dz are disjoint, additivity then gives 
k 
P(MnD)= 5° P(MNDj). (4.19) 
i=1 
Consider a term P(M 1 D;) on the right side of equation (4.19). If 
P (D;) # 0 then 
P(MND,;) = P(D;)P(M | D;) (4.20) 
by the conditional probability formula given in equation (4.2). 
If P(D;) = 0, then, since MM D; Cc D; we also have P(M 1 D;) = 0. So 
equation holds with P(D;)P(/ | D;) replaced by zero. 
This gives equation (4.16). 


The following simple corollary is sometimes convenient. 


Corollary 4.4. For some experiment, let D,,...,D, and M be events. Let 
D be the event that at least one of D,,..., D, occurs. 

Suppose that at most one of the events D,,...,D, can occur, and that 
P(M | D,;) =p fori =1,...,k, where p is some number. 

Then P(M | D) = p. 


Exercise 4.8. Prove Corollary |4.4 


Does Corollary [4.4] seem physically obvious? (Think of a hall with many 
doors, and suppose that for every door 71, a hungry tiger waits behind that 
door with probability p. Given that you must pass out through one of the 
doors, is it hard to calculate your chance of survival’) 
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Example 4.5 (Sampling without replacement). Consider the setting 
described in Exercise where the bowl contains 75 yellow beans, 53 red 
beans, 27 purple beans, and 18 green beans. 

As in Exercise [4.2] the experiment has two steps. In the first step we stir 
the bowl, and then select one jelly bean randomly, with no jelly bean in the 
bowl favored. We note the color of the chosen jelly bean, but do not replace 
the jelly bean in the bowl. 

In the second step, we stir the bowl again, and then select a second jelly 
bean, again with no jelly bean favored. 

Let A; be the event that the bean selected in step 1 is yellow or red. Let 
By be the event that the bean selected in step 2 is yellow or green. We would 
like to find P (A; N By). 

Because the bowl is stirred, we are confident that the only way the first 
step can affect the second step is by altering the numbers of jelly beans of 
each color in the bowl. 

Let Y; be the event that the first bean selected is yellow, and let R, be 
the event that the first bean selected was red. Then A; = Y; U R,. 

Let Yo be the event that the second bean selected is yellow, and let G» 
be the event that the second bean selected was green. Then By = Y2U Go. 

We could solve this problem by choosing a sample space consisting of all 
possible pairs of beans that could be selected, and then adding up probabil- 
ities of outcomes. However, we will use the law of total probability, which 
says that 


P(A, By) = P(%) P (B.| M1) + P(A) P(Bo| Ri). 


According to Exercise [2.19] the initial numbers of jelly beans in the bowl 
are as follows: 75 yellow beans, 53 red beans, 27 purple beans, and 18 green 
beans. 

By Theorem [2.23] P (¥;) = 75/173 and P (R,) = 53/173. 

We can also use Theorem to find P (B)|Y;) and P(B,|R,). The 
point here is that step 2 of the experiment is a “self-contained” sampling 
experiment, that is, a sampling experiment that can be considered by itself. 

Given Y,, we know that the bowl contains 74 yellow beans and 172 beans 
altogether. And there are 92 beans in the bowl that are yellow or green. 
Thus by Theorem [2.23] P( Bs | ¥i)= 4/172. 

Similarly, given R,, we know that the bowl contains 75 yellow beans and 
172 beans altogether, and there are 93 beans in the bowl that are yellow or 
green. Thus by Theorem |2.23| P (B| Ri) = 75/172. 
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The rest is arithmetic. 


Exercise 4.9. In the experiment of Example[4.5] suppose you learn that the 
second jelly bean chosen was purple. What is the probability that the first 
jelly bean chosen was also purple? 


Exercise 4.10. Solve part (ii) of Exercise again, applying the Law of 
Total Probability (Theorem |4.3). The natural sample space for part (i) of 
Exercise |3.3]is an interval of length 600. Use that as your sample space when 
applying the Law of Total Probability in part (ii). 


Exercise 4.11. There are two boxes on the table. Box 1 contains 10 red 
balls and 30 green balls. Box 2 contains 50 red balls and 10 green balls. Our 
experiment takes place in two steps. 


(1.) First, toss an unfair coin. The probability of a head is 2/3 for this 
unfair coin. 


(2.) If the result of the coin toss is a head, choose one ball at random from 
Box 1. Otherwise, choose one ball at random from Box 2. All the balls 
in Box 1 have the same chance to be selected. All the balls in Box 2 
have the same chance to be selected. 


Let A be the event that green ball is selected. 


(i) Find P(A), using the following sample space argument. 


Take 2 to be the set of all pairs (i,b), where 7 is the number of the 
box that is chosen, and b identifies the ball that is chosen from Box i. 
Let C; be the event that Box 1 is chosen, and let Cy be the event that 
Box 2 is chosen. You may assume from the physical description that 
every outcome in C; has the same probability, and every outcome in 
Cz has the same probability. The physical description also tells us the 
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values of P(C,) and P(C 2). Do not use the conditional probability 
formula or the law of total probability. 


(ii) Find P(A) again, using the law of total probability. 


We will return to the next exercise in Example There we will see 
how to use random variable concepts to obtain more information. 


Exercise 4.12 (Choosing from overlapping intervals). A fair coin is 
tossed. Suppose that if the result of the coin toss is a head, a point is chosen 
at random from [0, 3], with no point favored. If the result of the coin toss is a 
tail, a point is chosen at random from |2, 4], with no point favored. (Uniform 
distributions on continuous intervals are discussed in Section 3.3}) 

Let J be an interval of the real line, and let A be the event that the chosen 
point is in J. Using the Law of Total Probability, find P(A). Consider four 
cases: (i) J C [0, 2), (ii) J C [2,3), (iii) J C [3,4], (iv) J disjoint from (0, 4]. 

You are not required to define a sample space for this experiment. In your 
solution you can simply work with the laws of probability, without specifying 
a sample space. 

Notice that if you want to have a sample space that represents every- 
thing that happens in the two steps of the experiment, it will be a bit more 
complicated than usual. 


4.5 The theorem of Bayes 


Theorem 4.6 (Bayes). Let A and B be events for some probability model, 
such that P(A) > 0 and P(B) > 0. Then 


P(A)P(B| A) 


P(A|B) = Eo 


(4.21) 
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Theorem is an immediate consequence of the conditional probability 
formula, applied twice. First write P(AM B) as P(A)P(B|A) using equa- 
tion (4.2). Then find P(A|B) using equation (4.1). 

The formula in equation was found by Thomas Bayes, and pub- 
lished in a posthumous volume of his work in 1763. Although this formula 
bears his name, the same formula was independently found by Laplace. De- 
spite its simplicity, the formula is frequently used, often repetitively as ex- 
perimental results are accumulated. 

If we think of A as describing a “cause”, and B as describing an “effect” 
due to this cause, we might think of the formula of Bayes as showing how to 
calculate the probability of a possible cause when a certain effect is observed. 

The quantity P(B) in the denominator of equation can often be 
calculated using the Law of Total Probability. 

The number P(A) is sometimes called a “prior” probability, meaning 
the probability of A before an experiment takes place, while P(A | B) is the 
“posterior” probability of A, meaning that it is the probability of A after the 
event B is observed in the experiment. 

Notice that we need to have some idea of the value of P(A) to use Bayes. 

Let’s write out a form of equation (4.21) using the Law of Total Proba- 
bility. Let an event M be a subset of the union of disjoint events D,,..., Dr. 
Suppose that we observe MM, and we wonder which of the events D; occurred. 


By equation (4.17), 


yn P(DINM)_— P(D;)P(M|D;) 
P(D;|M) = P(M) ~~ S**_, P(D,)) P(M|D,) ee 


In ordinary life we frequently use reasoning similar to the Theorem of 
Bayes. Consider the following. 


Example 4.7 (An everyday mystery). Grandma has just finished baking 
one of her delectable cherry pies. She places it in an open window to cool. 
Shortly thereafter, she observes that the pie is missing. There are several 
people who may have passed her window during the interval. Only one, 
however, has an extreme fondness for pie. Rolling pin in hand, Grandma 
knows where to focus the next stage of her investigation. 
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Exercise 4.13 (Putting some numbers on Example (4.7). In the sit- 
uation of Example the pie was only in the window for a short period 
of time. Suppose that there are only three people who could have passed 
by Grandma’s window during this time period: Alice, Brandon, and Clyde. 
Let A be the event that Alice passed by, and let B and C the corresponding 
events for Brandon and Clyde. Grandma thinks it is very unlikely that two 
people passed her window during this period, so she considers these events 
to be disjoint. 

Grandma originally had no reason to think any of the three people is more 
likely than the others to pass by. She sets P(A) = P(B) = P(C) = 6, where 
0 is some positive number. Here P represents the probability that Grandma 
would have assigned to an event, before she discovers that her pie is missing. 

Let T' be the event that the pie in the window is taken. Alice and Brandon 
are highly reliable and have never shown any tendency to eat excessively 
large quantities of pie. Clyde, on the other hand, has a bad track record. 
Based on past events, Grandma sets P(7'| A) = .01, P(T'| B) = .01, and 
PC |C}=. 

Calculate P(C'|T). 


Grandma can deal with her problem without using numbers, of course. 
In other situations the precision of mathematical calculation may be needed, 
as the next exercise illustrates. 


Exercise 4.14 (A positive result in a test for disease). A rare but 
serious disease is present in approximately .01% of the people in a large 
population, i.e. a fraction 1/10000 of the population have the disease. 

There is a test for this disease. A positive result for this test is an indi- 
cation of disease. 

The test is good but not perfect. When a healthy person is tested, the 
probability of a false positive is .01, i.e. one percent. 

For simplicity, assume that the test never misses an actual case of the 
disease. That is, assume that the probability of a false negative is zero. 

Suppose that someone is randomly selected from the population and 
tested. The result of the test is positive. Find the probability that the 
person has the disease. 
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Exercise 4.15. In the experiment of Exercise |4.11| suppose you learn that 
a red ball was selected. Find the probability that the toss of the coin for this 
experiment produced a head. 


Exercise 4.16. Return to the experiment of Exercise [4.12] 
Let B be the event that a point in (2,3) is obtained. Find P(H | B). 


Exercise 4.17 (Bayes and the chosen coin). (i) To practice using the 
theorem of Bayes, let’s model a situation in which one of two coins is 
randomly chosen and then tossed. As usual when tossing a coin, we’ll 
think of getting a head as “success” for the toss. The coins are named 
coin 1 and coin 2. Coin 1 has success probability 2/5 and coin 2 has 
success probability 4/7. Suppose that each of the two coins has the 
same probability to be chosen. 


After the coin was chosen and tossed, you find that the result was a 
tail. You don’t know which coin was tossed. Find the probability that 
coin 2 is the coin that was tossed. 


Before calculating this probability, decide whether you think the prob- 
ability is greater than 1/2 or less than 1/2. 


(ii) Suppose now that in addition to coin 1 and coin 2 we also have coin 3. 
Like coin 2, this coin also has success probability 4/7. A new experi- 
ment is carried out, in which one of these three coins is selected with 
equal probability, and the selected coin is tossed. The result is a tail. 
Find the probability that the selected coin had success probability equal 
to 4/7. 
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The next exercise gives you a chance to practice with algebra and inequal- 
ities. Even if you don’t do the problem, think about equation (4.23) and see 
if it agrees with your own feelings about physical probabilities. 


Exercise 4.18. Consider two coins, coin a and coin b. Coin a has success 
probability p,, and coin b has success probability pp, where pg > py. That is, 
coin a is luckier than coin 0. 

Suppose now that one of these two coins is randomly selected. Let A 
be the event that coin a is selected, and let B be the event that coin b is 
selected. 

Assume that P(A) > 0 and P(B) > 0, so either coin could be chosen. 
If we don’t choose coin a then we must choose coin b. So of course P(A) + 
P( 2B) = 1. 

After the selection, suppose that the selected coin is tossed. Let H be 
the event that this toss gives success. Show that P(H) > 0 and 


P(A| H) > P(A). (4.23) 


Thus obtaining a success with the coin makes you more confident that it is 
the lucky coin. 


4.6 ‘Tree diagrams 


Pictures are helpful in any field of mathematics. In probability problems it 
can be helpful to represent events using a “tree of possibilities”, aka a tree 
diagram. Tree diagrams do not introduce any new concepts, but they can 
assist us in seeing what is going on, when a computation involves several 
events. 

Drawing a tree diagram seems to be an art rather than a science, since 
the goal is display ideas visually within a limited space. We can only make 
a few general remarks here, and then give some examples. 

When using a tree diagram, we only need to draw the part of a tree that 
represents events which we are interested in. 

And there is no general rule about whether a tree diagram will be useful. 
When drawing your own diagram, just starting a tree may be enough to 
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suggest how to actually approach the problem, and you can switch to using 
equations. 

Every tree has a root. The branches spread out from the root, and may 
continue to spread. 

Every branch has two ends. The ends of the branches are often called 
“nodes”. The root is a node. A branch begins at one node and ends at 
another, and you have to remember which is which (the starting node is the 
one which is closer to the root). 

In a probability tree diagram, the nodes represent events. The root of 
the tree represents 2. The ending node of any branch represents an event 
which is a subset of the event represented by the starting node. So a branch 
represents inclusion. And the end of one branch can be the starting node for 
another branch. So tree diagrams can potentially be large. 

If a branch starts with an event A and ends with an event B, then we 
often label the connecting branch with the conditional probability P(B | A). 
Each node along a chain of branches is always a subset of the earlier nodes 
in the chain. And the probability of a node which lies at the end of a chain 
of branches is equal to the product of the conditional probabilities along the 
chain! 

When looking at tree diagrams in probability books, you will notice var- 
ious ways of labelling nodes in practical situations. Informality is the order 
of the day, and making ideas clear is the priority. The trees in tree diagrams 
may be drawn upside down, with the root at the top, or lying on one side! 

If a branch starts at a node called A and ends at a node representing 
AMM, you may see the ending node labelled with “M”, rather than AN M. 
In other words, we often label a node using only the additional properties 
which distinguish it from the preceding node. However, to make sense of 
the diagram you should think of the ending node as representing the event 
AN M. (Exercise shows why this convention does not change how we 
calculate the conditional probability which labels the branch.) 

A nice example of a tree diagram is given in [12], for the Monty Hall 
problem (see Section [6.2] for information about Monty Hall). 

Figure shows a small tree diagram for Exercise Notice that only 
the relevant parts of the tree are shown. Since this is such a simple situation, 
the tree serves no purpose, but the picture does show how trees work. Note 
the informal labelling. 

Let’s change Exercise Instead of calculating the probability of two 
red jelly bean, let’s find the probability of winding up with a red and a green. 
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Figure 4.2: Obtaining two red jelly beans, one at a time (Exercise [4.2). 


Figure [4.3]shows a tree diagram for that calculation. To get the final answer 
for this problem, note that you add the probabilities of two events: getting 
a red and then a green, and getting a green and then a red. So you add the 
probabilities associated with two paths on the tree, and the final answer is 


73-18 
173-172 


Notice that in a tree diagram, nodes which are not on the same chain of 
branches necessarily represent disjoint events. Interpreting a tree diagram 
uses the Law of Total Probability, in an informal manner. 


Example 4.8. Just for fun, let’s make a slightly bigger tree diagram. Think 
of randomly selecting jelly beans, one at a time, from a bow] containing two 
red jelly beans, one yellow jelly bean, and one green jelly bean. We want the 
red ones, so we will stop as soon as we obtain both red beans! 

Let A, be the event that it takes exactly n tries to get both red ones. 
Here n might be 2, 3, or 4. Suppose we would like to find P(A,). 

Figure [4.4]is a tree diagram for this problem, where getting a red bean is 
represented by the end node of an upward branch, getting a yellow bean is 
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Figure 4.3: Obtaining one red and one green jelly bean, one at a time, in 
either order. 


represented by the end node of a horizontal branch, and getting a green bean 
is represented by the end node of a downward branch. To find P(A,,), add 
up the probabilities for the paths with length n. This gives P(A2) = 1/6, 
P(A3) = 1/3 and P(A,) = 1/2. 
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Figure 4.4: Sampling until two red jelly beans are obtained, starting with 2 
red, 1 yellow, and 1 green. Upward indicates a red bean, horizontal indicates 
a yellow bean, and downward indicates a green bean. There is one path of 
length two, four paths of length three, and six paths of length four. 


4.7 Solutions for Chapter 


Solution (Exercise [4.1). Since |B| = 53+ 27+ 18 = 98 and |Q“| = 
75 + 53 + 27 + 18 = 173, P(B) = 98/173 by Theorem [2.23] 
AN B is the event that the selected bean is green. Hence P(AN B) = 
18/173 by Theorem [2.23] 
By the conditional probability formula, P(A |B) = (18/173) /(98/173) = 
18/98. 


Solution (Exercise |4.2). There are 75 yellow beans, 53 red beans, 27 
purple beans, and 18 green beans in the bowl. 
(i) This part of the question is a review of Exercise 
There are 173 beans in the bowl, and 53 of them are red. By Theo- 
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rem 23] 


53 


(ii) After the selection of a red bean, the bowl contains 75 yellow beans, 
52 red beans, 27 purple beans, and 18 green beans. This is the setting for 
the second selection, given that B occurred. By Theorem [2.23] 

52 


(iii) By the mutiplied-through version of the conditional probability for- 
mula, equation (4.2), 
53 52 
173 172° 
This agrees with the probability found in Exercise 
Solution (Exercise|4.3). By the conditional probability formula (equation 


(4.1), 


P(R) = P(B)P(R| B) = 


P(ANBNB) P(ANB) 
P(AN B|B)= = = P(A| B). 
(An B|B) =“ Say = PIB) 
The first and last equalities hold by the conditional probability formula. The 
middle equality holds because BN B = B. 


Solution (Exercise|4.4}). We can use either equation (4.2) or equation (4.1) 
here. 
Equation (4.2) tells us that 


P(A)P(B| A) = P(ANB), 


so that 
P(ANBNC)=P(ANB)P(C|ANB), 
and 
P(ANBNCND)=P(ANBNC)P(D|ANBNC). 
Thus 


P(A)P(B| A)P(C| AN B)P(D| ANBNC) =P(ANBNCND). 


Alternatively, using equation (4.1), we can subsitute for each conditional 
probability and then perform cancellations. 
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Solution (Exercise |4.5)). By definition, 


P(CNB) 


Q(C) = ~P(B) 


Since we assume that P(C'N B) > 0, Q(C) > 0 also. 
By definition, 


P(ANCNB) 
ANC) -- pay ANCHAB 
quic) =r he = = Soa P(A| BNC). 
P(B) 


Solution (Exercise |4.6). Since CN B = C, everything follows from Exer- 
cise 


Solution (Exercise |4.7)). 


P(ANCNBNC) _ P(ANBOC) _ 
P(BNC) ~  P(BNC) 


P(ANC|BNC) = P(A| BNC). 


The first and last equalities hold by the conditional probability formula. The 
middle inequality holds because CN C=C. 


Solution (Exercise /4.8)). By total probability, 
P(MND) =P P(M|D;) = 5\P (Dj) p = pP(D). 


Dividing by P(D) gives the result. 


Solution (Exercise [4.9). Let P, be the event that the first jelly bean 
selected was purple, and let P: be the event that the second jelly bean selected 
was purple. We would like to find P(P, | P.). 

We can use a sample space consisting of all pairs of jelly beans (41, j2), 
where 7, # ja. Since there are 75 + 53 + 27 + 18 = 173 jelly bean altogether, 
the sample space contains 173-172 sample points. The physical description 
of the experiment tells us that all sample points are equally likely. 

Let p denote the probability of a sample point. Then 


4 
DP = 773.172" 
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The number of purple jelly beans is 27, and the number of non-purple 
jelly beans is 173 — 27 = 146. 

Since there are 27 purple jelly beans, P, M P) contains 27 - 26 sample 
points. 

We have 


Using the same sample space, a sample point in P consists of all sample 
points (71,72), such that jo is purple. There are 27 choices for jz, and for 
each possible j2 there are 172 choices for j;. Thus P, contains 172-27 sample 
points, and so P (P) = (172 - 27)p. 

Hence 

P(P,M P2) (27 - 26)p 27 - 26 is 


P (P,| P) = = = ape 
PF) P (P,) (172x27)p 17227 86 


It should be noted that to find P (P2) there is no need to consider the 
first step of the experiment. P, occurs if the second jelly bean is purple, and 
there are 27 choices for that jelly bean, out of a possible 173. Thus 


One should check that this agrees with the value found earlier: (172 - 27)p. 


Thinking backwards As an alternative method, we might ignore the 
physical times at which the steps occur, and just think about the possible 
pairs of jelly beans (j1, j2) that are obtained. One could think about building 
a pair by (mentally) selecting j2 first, and then selecting 7,. There are 173 
choices for j2, and then, having chosen 72, there are 172 choices for 71. 

Thus to find P(P, | P2), think that a purple jelly bean has already been 
chosen for ja. P (P, | P2) is the probability that the choice of 7; now gives a 
purple jelly bean. Thus 


26 13 
P(P1| Pa) = 75 = a6 


as before. 


Solution (Exercise |4.10). As in the solution for part (i) of Exercise 
P(A) = 300/600 = 1/2. 
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Suppose that the situation described in part (ii) of Exercise holds. 
That is, Alice has searched two-thirds of her section, Bob has searched half 
of his section, Clancy has searched three-quarters of his section. We use 
probabilities based on this information. 

Let N be the event that the coin has not yet been found. 

Let A be the event that the coin is located in Alice’s interval, let B be 
the event that the coin is located in Bob’s interval, and let C' be the event 
that the coin is located in Clancy’s interval. The events A, B,C are disjoint, 
and AUBUC=2. Part (ii) of Exercise [3.3] asks us to find P(A| JV). 


By equation (4.17), 
P(N) = P(A)P(N|A)4+ P(B)P(N|B)+P(C)P(N|C). 


If event A holds, then the activities of Bob and Clancy have no influence on 
the discovery of the coin. 

Let I be Alice’s interval and let J the part of Alice’s interval which has 
not yet been searched. Then P(N | A) is simply the probability that the coin 
is located in J. Thus 


_ length(J) 1 


Ea length(/) 3 


The values of P(N | B) and P(N | C) are found similarly. 
This gives 


1 1 1 1 1 1 3 
PINS (=) [= a8 ee cage = ee 
= (5) 3) *(3) @) +6) G)=3 
We need to find P(V | A). By the Conditional Probability Formula, 


P(N| A) = Ea 2 (2) (3) _ . 


Solution (Exercise |4.11). (i) The probability of a head is given to be 
2/3. 
Hence P (C;) = 2/3. There are 40 outcomes in C), each of equal proba- 
bility. Hence every outcome in C has probability (2/3)(1/40). 
P (C) = 1/3. There are 60 outcomes in C2, each of equal probability. 
Hence every outcome in C2 has probability (1/3)(1/60). 


134 


4.7. Solutions for Chapter 


There are 30 outcomes in AMC}, each with probability 1/60. Hence 
P (ANC) = 1/2. 

There are 10 outcomes in AM Cy, each of probability 1/180. Hence 
P (AN C,) = 1/18. 

Thus P(A) = 1/2+ 1/18. 


(ii) 


230 110 1 1 
340° 3602. 18 
Solution (Exercise (4.12). Let H be the event that the coin toss gives 

a head. Let T be the event that a tail is obtained. Since the coin is fair, 
Pia ya Pe 172. 

Let J be subinterval of [0,4]. Let A be the event that the chosen point is 
in A. 

By the Law of Total Probability, 


P(A) = P(H)P(A| H) + P(T)P(A|7). (4.24) 


P(A) = P (Ci) P(A|Ci) + P (C2) P (A| C2) = 


Case (i): JC [0,2) When H occurs, the point is chosen from {0,3) with 
uniform probability on [0,3). By equation (3.4), 


— length(/) — 1 
P(A| HA) = length((0,3)) > 3 length(/). (4.25) 


When T occurs, the point is chosen from [2,4], so P(A| 7) = 0. 
Substituting in equation (4.24), 


P(A) = ‘ length(J n (0, 3)) (4.26) 


Case (ii): J C [2,3) When 4H occurs, the point is chosen from [0,3) with 
uniform probability on [0,3). By equation (3.4), 


length(/) — 1 
length((0,3)) 3 length(J/). (4.27) 


P(A|H) = 
When T occurs, the point is chosen from [2,4] with uniform probability. 
By equation (3.4), 


_ length(J) 
~ length((2, 4]) 


P(A|T) - = length(.J). (4.28) 
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Substituting in equation (4.24), 


1 1 1 5 
ee ee a : 4.2 
P(A) - ienvinty) + oo length(J/) 9 length(J/) (4.29) 
Case (iii): J C [8,4] When H occurs, the point is chosen from [0,3), so 
P(A| 2) =0. 

When T occurs, the point is chosen from [2,4] with uniform probability. 
By equation (3.4), 


length(/) 1 
P(A|T) = = —length(J/). 4, 
(AIT) = jpesen(pay = 3 lensthl/) (4.30) 
Substituting in equation (4.24), 
1 
P(A) = Fi length(J/). (4.31) 


Case (iv): J disjoint from [0,4] In all cases, the point is chosen from 
[0,4], so P(A) = 0. 


Solution (Exercise [4.13). Grandma considers that the events A, B,C are 
mutually exclusive. Let D be their union. 

The physical statement of the problem tells us that Alice, Brandon and 
Clyde are the only people that could have taken the pie. Hence T’ c D, so 
TOD=T. By the Law of Total Probability (Theorem |4.3), 


P(T) = P(A)P(T| A) + P(B)P(T| B) + P(C)P(T|C) 
= 6(.01) + 6(.01) + 6(.5). (4.32) 
By definition, 


P(TNC)  P(C)P(T|C) _ 6(.5) __ 50 
P(T) P(T) ~~ 6(52) 52 


PC |T) = 


This number is close to one, and directs Grandma’s attention to Clyde. 


Solution (Exercise |4.14). Let A be the event that the person who is 
tested actually has the disease. Let B be the event that the test is positive. 
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Then 


P(ANB) P(A)P(B| A) 
Pe) P(B) _P(A)P(B| A) + P(ADP (B| AD) Ce 


We are given that P(A) = .0001. Hence P(A°) = .9999. 
We are given that P(B| A°) =.01 and P(B| A) =1. 
Thus 


.0001(1) 1 i 


P(A|B) = = = 
oe .0001(1) + .9999(.01) 1+99.99 100.99 


Remark 4.9 (The worst case for a positive test recipient). By equa- 


tion (4.33), 
i] 


14 (4) Gar 
Notice that for any given values of P(A), P(A‘) and P(B | A‘), the quan- 
aD decreases when P(B | A) increases. 

Thus the denominator of the fraction in equation decreases when 
P(B | A) increases. 

We conclude that P(A|B) increases when P(B | A) increases. 

So taking P(B| A) = 1 gives the largest possible value for P(A |B) when 
the other numbers are known. Thus in Exercise |4.14) we have calculated 
P(A| B) in the worst case. 


P(A|B) = (4.34) 


tity 


Solution (Exercise |4.15}). Let R be the event that a red ball was selected, 
and let H be the event that the coin which was tossed produced a head. 
We wish to find P(A | R). 


P(HOR) P(H)P(R| H) 44 


PHIM =—SiRy ~ PENP(RIH) +P) P(RIW) 1420 6 


Solution (Exercise |4.16}). Using Bayes, 


P(H MB) 


P(H |B) = Sop 
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Thus 
P(H)P(B|#) 


P(H |B) = So 


By the solution to Exercise 


5 5 length((2, 5) _ 2 
2 length((2,3)) 5 


P(H|B) = 


Solution (Exercise |4.17)). 


(i) The coins had equal chances of being chosen, and the result (a tail) is 
more likely if coin 1 was tossed. This makes it more likely that coin 1 was 
used, so we should expect that the probability that coin 2 was used is less 
than 1/2. 

Let A be the event that coin 2 was used, and let B be the event that the 
result was a tail. We wish to calculate P(A| B). 

Using Bayes, 


P(A)P(B| A) P(A)P(B| A) 
P(AIB)=—~"“B0B) > P(A)P(BIA) + PLANP(B AD 
__ 347) 5 1 
3@)2a0) 2 


(ii) Now we let A be the event that the selected coin had probability equal 
to 4/7, ie. the event that coin 2 or coin 3 was used. Similarly to part (i), 
we then have 


_ P(A)P(B| A) _ P(A)P(B| A) 
P(A|B) = P(B)—~*P(A)P(B] A) + P(ADP(B| AD) 
__3@) ot 
H+) FF 
Solution (Exercise [4.18). 
P(AN H) P(A)P(H | A) 
PAV) = San = PAPUA) + P(B)PUT |B) 
That is, 
P(A)pa 
EE Baya PCO): 


4.7. Solutions for Chapter 


Take a look at the final expression in equation (4.35). If we replace pp by pa, 
then the denominator gets bigger. So the fraction gets smaller. This tells us 


that 
PA = iéPAY 
P(AIH) > Sip, + PB. P(A) + PB) 
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Chapter 5 


Independence and its 
consequences 


5.1 Independence defined 


Consider an experiment. Let A be an event which describes one property of 
the result, and let B be an event which describes another property of the 
result. 


Definition 5.1 (Physical independence). We will say that physical events 
A and B are independent if knowledge that A occurred does nothing to 
change your opinion about whether B occurred, and vice versa. 


Assume P(B) #4 0. We have said that P(A|B) is the probability of A 
given the information that B occurred. If A and B are physically indepen- 
dent, then the occurrence or non-occurrence of B is irrelevant to predictions 
about A. That is, we must have 


P(A| B) = P(A). (5.1) 
Using the mathematical formula for conditional probability, this says that 
P(An B) 
———— = P(A 
and so 
P(AnN B) = P(A)P(B). (5.2) 
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Notice that equation (5.2) also holds when P(B) = 0, and this equation is 
symmetric in A and B. 


Definition 5.2 (Mathematical Independence). Let A, B be events in a 
mathematical model. Whenever holds, we say that the events A and B 
are independent. Equivalently, we say that the pair A, B is independent. 

Sometimes it is convenient to express independence more colloquially, by 
saying that A is independent of B. Of course this also implies that B is 
independent of A. 


Equation is the whole definition of mathematical independence. 
Sometimes one refers to mathematical independence as “statistical indepen- 
dence”. This reflects the fact that mathematical independence will hold in a 
model whenever the experimental statistics fit equation (5.2). We can assert 
that events are independent without identifying an underlying physical cause 
to explain why they are independent. 

Incidentally, when we say “A and B are independent events”, it may 
sound as if there is a property called “independence” that each event can 
have separately. That is not the case, and one should keep in mind that in- 
dependence expresses a relationship, and is property of two events considered 
together. 

When holds for two events, we also say that the probability of the 
events is multiplicative, meaning that the probability of the intersection is 
equal to the product of the separate probabilities. 


Remark 5.3. Let A, B be events with P(B) 4 0. Our discussion shows that 
following statements are equivalent. 


(i) A, B are independent. 
(ii) P(A| B) = P(A). 


We can use whichever formulation is convenient. 


Example 5.4 (Tossing a coin twice). Consider the experiment of tossing 
a coin twice. Let H; be the event that the result of the first toss is a head, 
and let H» be the event that the result of the second toss is a head. 
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Let p = P(H;). Based on our experience with coins, we expect that also 
P(H2) = p. What about the probability of obtaining two heads in succession, 
i.e. P(A, a Hy)? 

In ordinary experience, neither the coin nor the tosser is significantly 
altered by the result of the first coin toss. So we expect that when P(H,) > 0, 
P(H2|H,) = P(H2). And indeed experience shows us that the probability 
of a head on the second toss is unaffected by the result of the first toss. Thus 

P(A, 9 Ae) = P(A) P(H2 | Hi) = P(A) P(A). 


By Definition [5.2] H1,, Hz are independent. 
Equation (5.2) is easy to verify directly when P(H,) = 0! Thus in all 
cases, H,, Hy are independent. We conclude that 


P(H,M Hz) = P(H,)P(A2) = p*. 
The same argument works for any combination of heads or tails on the 
two tosses. Thus, with gq = 1 — p we also have 
P(H, 0 H3) = P(£)P(H) = pa, 
P(A} He) = P(Hy)P(H2) = ap, 


and 
P(H{ 9 HS) = P(Hf)P(H5) = ¢’. 


Exercise 5.1 (Sample space for two tossses). We can choose a particular 
sample space 2 to model tossing a coin twice. For example, let 1 denote a 
head and let 0 denote a tail, and take 0 = {(1, 1), (1,0), (0,1), (0,0)}. 


(i) Define H, = {(1,1), (1,0)} and define Hy = {(1, 1), (0, 1)}. 
Find Ay a Alo. 
(ii) Let g=1—p. 


Show that the correct definition for P on this sample space is the 


following. 
P({(1,1)} =p’, 
P({(1,0)} = pa, 
P({(0,1)} = ap, 63) 
P({(0,0)} =? 
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(iii) Verify that the values in part (iii) give P({(1,1)}) + P({(1,0)}) + 
P({(0, 1)}) + P({(0,0)}) = 1. 


In previous comments we have mentioned that an abstract sample point 
need only represent the properties of an outcome that we currently wish to 
analyze. Example [2.13]provided a model that represents tossing a coin once. 
Now in Exercise [5.1] we have considered a model for two tosses. The model 
for two tosses also gives a representation for a single toss, since each of the 
two tosses is a single toss by itself. Are these representations consistent? The 
next exercise addresses that question. 


Exercise 5.2. This exercise extends Exercise [2.4] from the case of a fair coin 
to the case of a general coin. 

In Exercise let A be the event that the first of two tosses results in 
a head. Check that P(A) = p, using the sample space for two tosses. That 
means you can use the probabilities for the outcomes (1, 1), (1,0), (0,1), (0, 0), 
but nothing else. 

Do the same for the event B that the second of two tosses results in a 


head. 


Remark 5.5 (Checking our models, or not). In situations like Exer- 
cise [5.2] our physical experience guides us very clearly. Since our models for 
one toss and two tosses represent experience so well, we are confident that 
they must be mathematically consistent with each other. 

For that reason, we won’t usually bother proving this sort of consistency, 
even though we did so in Exercise 


Exercise 5.3. Return to the situation of Exercise This deals with the 
experiment of rolling a fair die twice. You are asked to find the probability 
that the first roll produces an even number and the second roll produces a 
number larger than four. 
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Use independence to obtain the answer. You may use the fact that any 
event which only involves the first roll is independent of any event which only 
involves the second roll. 

(And of course, if your answer now does not agree with the value found 
in Exercise 2.7 something is wrong.) 


Exercise 5.4. When rolling a fair die twice, let A be the event that the sum 
of the numbers obtained on the two rolls is an even number. 

Find P(A). You have solved this problem in Exercise This time use 
independence to save work. 


Exercise 5.5 (Simple cases of independence). Please check the following 
two easy special cases of independence. 
For any events A, B, 


P(A) =0 or P(A) =1 = > A and B are independent. (5.4) 


Exercise 5.6 (Independent and disjoint?). Suppose that events A, B are 
disjoint. Under what conditions will A, B also be disjoint? 


We will see many more examples of independent events in the rest of 
this book. Independence simplifies probability calculations immensely. This 
makes it tempting to assume independence when analyzing problems, which 
can lead to errors. An example of the misuse of probability in a court case, 
including an unjustified assumption of independence, is given in [2]. 
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5.2 Independence applies to complements 


Before returning to examples, let’s take a little while to explore the definition 
of mathematical independence. 


Lemma 5.6 (Independence and complements). Let A and B be any 
events in a probability model. Each of the following statements is mathe- 
matically equivalent to any of the others. 

i) A, B are independent. 
ii) A, B° are independent. 


iii) A°, B are independent. 


( 
( 
( 
( 


iv) A‘, B° are independent. 


The phrase “mathematically equivalent” for two statements means that 
if either one of the statements is true then the other is true also. The equiv- 
alence stated in Lemma\5.6|seems totally reasonable if we think about infor- 
mation. If you know whether or not A occurred, then you know whether or 
not A‘ occurred, and so on! 

Using the definition of independence, Lemma can be restated as fol- 
lows. 


Lemma 5.7 (Equations for independence and complements). Let A 
and B be any events in a probability model. Each of the following four 
equations is mathematically equivalent to any of the others. 


P(AN B) = P(A)P(B), (5.5) 
P(AN B®) = P(A)P(B*), (5.6) 
P(A°N B) = P(A°)P(B), (5.7) 

P(A°N B°) = P(A°)P(B’). (5.8) 


The sets ANB, AN B®, ACN B and A°N B* are represented in Figure[5.1] 


146 


5.2. Independence applies to complements 


A proof for Lemma f5.7]is requested in Exercise To give a mathemat- 
ical proof we will have to think about the precise definition of mathematical 
independence, not just the physical meaning. As usual, you are encouraged 
to work at the proofs, but the physical meaning is the most important thing. 


Exercise 5.7. Prove Lemma\|5.7| 
A good first step is to prove the following. 


Substitution Fact For any events D,, D2, suppose that: 

P (Di N Dz) = P(D;) P (D2) (5.9) 
holds. Then 

P(e) = PL) Ps): (5.10) 


In other words, replacing D; by Dj throughout the first equation gives an- 
other true statement. 

Of course, since order doesn’t affect the intersection operation, and order 
doesn’t affect multiplication either, the Substitution Fact also implies that a 
true statement is also obtained from equation if Dg is replaced by D5. 

Once the Substitution Fact is proved, you can apply it to proving the 
lemma. 


CB 


) A divides the space ) B divides the space ) The pieces 


Figure 5.1: The pieces of 2 generated by A and B 


The next exercise is also physically clear, since the criterion for indepen- 
dence is essentially a mathematical rephrasing of Definition 
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Exercise 5.8 (A test for independence). Let A, B be events with P(A) > 
0 and P(A‘) > 0. 


(i) Suppose that 
P(B| A) =P(B|A’°) (5.11) 


Show that A, B are independent. 


(ii) Suppose that equation (5.11) does not hold. Show that A,B are not 
independent. 


5.3 Independence for sampling with replace- 
ment 


Ok, back to examples. As in Example [4.5] and Example we consider a 
two-step experiment. We select two jelly beans, one at a time. Each selection 
is random, and is such that no jelly bean in the bowl is favored. 

However, unlike Example [4.5] and Example [2.24] after we have noted the 
color of the first jelly bean that is selected, we replace it in the bowl before 
proceeding to make the second selection. 

Let A, be the event that the bean selected in step 1 is yellow or red. Let 
By be the event that the bean selected in step 2 is yellow or green. We would 
like to find P (A; NM By). 

Assume that in the bowl, before any selections, there are y yellow beans, 
r red beans, and g green beans. 

Consider each step as an experiment in itself. As usual, by Theorem [2.23] 
we have 
me Sasa P (Bp) = _9T9 (5.12) 
Nae ie el Ur |g 

Because we stir the bowl before each selection, our physical experience 
tells us that the results of step 1 and step 2 are physically independent, and 
so we confidently assume that A, and By are mathematically independent. 
Hence we can find P(A; Be) at once: 


P (Ai) = 


P (A, A By) = P(A;) P (By). (5.13) 
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Notice that we did not need to actually specify a sample space for the 
two-step experiment. Instead we simply followed the rules of probability, 
using independence. 


Remark 5.8 (A bit more about the “why” of independence). In an 
experimental situation, we would expect statistical independence for events 
A and B if there is no physical connection between the physical processes 
involved in A and the physical processes involved in B. But in a literal sense, 
that is almost never the case. 

Consider a person tossing a coin twice. The physical action of tossing 
the coin the first time certainly causes changes in the tosser, even though 
they are small. For example, the tosser will likely change position slightly 
when tossing the coin, and picking it up. And the tosser’s arm will be very 
slightly tired by the tossing action. Does that matter for the result of the 
second toss? Well, we know that even small changes can change the result of 
a toss, so we can’t immediately dismiss the idea that it matters. Of course 
our real-world experience shows that it doesn’t matter, at least as far as the 
statistics of the two tosses is concerned. 

The current jelly bean experiment might be a little easier to analyze. 
Let’s think about that. 

Example says that when choosing jelly beans we prepare for the 
experiment by stirring the jelly beans vigorously, so that the beans in the 
bowl are thoroughly mixed. 

Since we have a two-step experiment, we will stir that bowl of jelly beans 
twice. That first stirring of the beans is certainly going to have an effect on 
the starting situation of the second step. In fact, it will have a big effect. It 
may move every single jelly bean in the bowl! And yet we are confident that 
the two steps have statistically independent results! What is going on here? 

To fix our ideas, let’s imagine that in the experiment we always select the 
top bean in the very center of the bowl. Call that the “pickup location”. 

For simplicity, let’s assume that the bowl only contains yellow and red 
jelly beans, and that there are exactly the same number of yellow and red 
jelly beans. 

Given these assumptions, we are confident that the chance of a yellow 
bean being selected is 1/2. And we don’t believe this depends on the state 
of the bowl before the beans are stirred. Somehow stirring is just as likely to 
move a red bean into pickup location as a yellow bean. 

It would be nice to give a theoretical explanation of how stirring creates 
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independence, and more generally to give a theoretical characterization of 
which physical connections will alter statistical results, and which will not. 
But that seems to be an unsolved hard problem. 

Our probability model for choosing jelly beans doesn’t concern itself with 
the details of stirring, so we do not try to explain why the choices are in- 
dependent. Our judgement about the probability of selecting a bean is just 
built into the model, based on our general practical experience. 


5.4 Using independence to simplify calcula- 
tions 


We often use independence to justify zgnoring events. For example, suppose 
that in some large and complicated experiment we think about three events, 
A, B, and C. Physically, suppose we believe that A and B depend on certain 
properties of the experimental setup which are unrelated to the occurrence 
or non-occurrence of C. 

As a practical matter, when calculating something about A and B, for ex- 
ample P(B | A), we can completely ignore C’,, even if we know whether or not 
C occurred. We do this automatically in our problems. Calculations of prob- 
abilities would be hopelessly complex if we could not make simplifications of 
this sort! 

That’s the physical picture. It would be interesting to consider mathe- 
matical ways to express the fact that C’ can be ignored, but we won’t take 
time for that, except in the next exercise. 


Exercise 5.9. Assume that 
P(C)}, (5.14) 
P(C), (0.15) 
P(AN BNC) =P(ANB)P(C). (5.16) 
Show that then 
P(B| ANC) =P(B| A) (5.17) 


Equation (5.17) is an example of ignoring C, when the three independence 
statements (5.14), (5.15) and (5.16) all hold. One might be guess the first 
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two independence statements should be sufficient, but they ain’t. Something 


like condition (5.16) is needed too. 


5.5 Extending independence to unions 


This section gives us a chance to play a little more with the general definition 
of independence. 

The following lemma is not surprising, but it states a useful fact. Justi- 
fying Lemma [5.9]is a good exercise. 


Lemma 5.9 (Independence using cases). Let A;,..., A, be disjoint events 
in some probability model. Let B be an event such that A;,B are indepen- 
dent for each i= 1,..., A,. Then A; U...U A, and B are also independent. 


Exercise 5.10. Prove Lemma 


5.6 Solutions for Chapter 


Solution (Exercise |5.1). (i) From the definitions, H, N Hz = {(1,1)}. 


(ii) We already showed that H, M Hp = {(1,1)}. 
Similarly 4, H§ = {(1,0)}, HfN He = {(0,1)}, and HfN Hs = {(0,0)}. 
Comparing this with the facts in Example |5.4| gives equation (5.3). 


(iii) 
P({, D}) + PYG, 0)) + P({(0, 1)}) + P({(0,0)}) =p? + pq + ap+¢? 
=p(p+q)+q(p+q)=(p+q) =1. 
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Solution (Exercise [5.2). A = {(1,1), (1,0)}, so P(A) = P({(1,1)}) + 
P({(1,0)}) =p? + pq = p(p + @) =p. 

B = {(1,1), (0,1)}, so P(B) = P({(1,1)}) + P({(,D}) = p? + ap = 
(p+ q)p =p. 


Solution (Exercise [5.3). Let A the event that the first roll gives an even 
number. Let B be the event that the second roll gives a number larger than 
four. 

We can look at the first roll as a separate experiment. The sample space 
has 6 outcomes of equal probability, so P(A) = 3(1/6) = 1/2. 

Similarly we can look at the second roll as a separate experiment, so 
P(B) =2(1/6) = 173. 

By independence, P(AN B) = P(A)P(B) = (1/3)(1/2) = 1/6. 

This approach is more efficient than the method used to solve Exercise[2.7| 
Physically we are sure that the two methods are both valid. 


Solution (Exercise (5.4). Let B, be the event that the first roll gives an 

even number, and let C; be the event that the first roll gives an odd number. 

Let Bz be the event that the second roll gives an even number, and let 
Cy be the event that the second roll gives an odd number. 

Clearly P(B,:) = 4+ 4+ % = 5. Similarly P(Ci) = 3, P(B2) = § and 
P(C2) = 5. 

The sum of an even number and an odd number is odd. Even plus even 
is even, and odd plus odd is even. 

Thus 


A= (BN By) U(CL NC). 
Using additivity and independence, 


—11 1101 
ae ae 


P(A) 


Solution (Exercise [5.5). Suppose that P(A) = 0. Then for any B, P(AN 
B) < P(A) =0,so P(AN B) =0= P(A)P(B). Thus by definition A, B are 
independent. 

Now suppose that P(A) = 1. By Exercise PAT 2) = Ps) = 
P(B)P(A), so by definition A, B are independent. 


Solution (Exercise|5.6)). Suppose that A, B are disjoint. Then P(ANB) = 
0. 
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If A, B are also independent then P(ANM B) = P(A)P(B). Since 0 = 
P(A)P(B), at least one of the events A, B has zero probability. 

If P(A) =0 or P(B) = 0, then A, B are independent by Exercise [5.5] 

This shows that when A, B are disjoint, then A, B are independent if and 
only P(A) = 0 or P(B) =0, 


Solution (Exercise[5.7). First let us prove the Substitution Fact. Suppose 
that for particular events D,, Dz we know that P (D,M D2) = P (D,) P (D3). 
We need to show that if we replace D; by D{ in this equation we get another 
true equation. 

Since Dy = (D,N D2) U (DEN D2), and the union is disjoint, we always 
have P (D2) = P(D,N D2) + P (D{N Dg), so 


P (Di N D2) = P (D2) — P (Di 1 Dz) = P (Dz) — P (D1) P (D2) 
=(1—P(W1)) P (D2) =P (Di) PW), 


as claimed. 

This proves the stated Substitution Fact. 

To prove the lemma, we consider some applications of the substitution 
operation. 

Suppose that equation (5.5) is true. Replacing A by A® gives equa- 
tion (5.7), so this equation must also be true. 

On the other hand, suppose that equation (5.7) is true. Replacing A‘ by 
(A°)° = A gives equation [5.5] so equation [5.5] must be true a 

We have shown that the truth of either one of equations (5.5) and (5.7 (5.7) 
implies the truth of the other. 

Switching between B and B° shows that equations(5.5) and are 
equivalent. 

Switching between A and A* shows that equations and are 
equivalent. 

Thus we can change any one of the equations into any of the other equa- 
tions, using a sequence of substitutions operations, and these substitution 
operations preserve truth. 


Solution (Exercise |5.8)). 
(i) Suppose that P(B| A) =P(B| A‘). 
By the Law of Total Probability (Theorem 4.3), 


P(B) = P(A)P(B| A) + P(A)P(B| A’). (5.18) 
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Hence 
P(B) = P(A)P(B| A) +P(AP(B| A). (5.19) 
Since P(A) + P(A‘) = 1, 
P(B) = P(B| A). (5.20) 
By Remark [5.3] A, B are independent. 


(ii) We must show that if A, B are independent then equation must 
hold! (Does it make sense that this formulation is equivalent to what is asked 
in part (ii) of the question? It certainly will if you think about it. We could 
talk here about the “contrapositive” form of a statement, but we don’t need 
no logic definitions to figure this out.) 

Since A, B are independent, Remark [5.3] tells us that 


P(B| A) = P(B). 


Also, by Lemmaj5.6] if A, B are independent then also A‘, B are independent. 
So we can replace A by A® in the equation just obtained. This gives 


P(B| A*) = P(B), 
so P(B| A‘°) = P(B| A). 
Solution (Exercise |5.9)). 


Proof. By the conditional probability formula, 
P(BNANC) P(ANB)P(C) P(ANB) 


P(B|A = = = =P(B|A 
BA BANG) P(A)P(C) P(A) P14) 
proving equation (5.17). 


Solution (Exercise |5.10). 
P(BN(A,U...UAx)) = P((BN Ai) U...U(BNAg))- 


By assumption, A,,..., A, are disjoint, so BN A,,..., BM A, are disjoint. 
Hence 


P(BN(A,U...UAx)) =P (BNA) +...+P (BN Ax) 
= P(B)P (A,)+...+ P(B)P (Ax) 
= P(B) (P(A,)+...+ P(A,)) = P(B)P (A, U...U Ax). 
By definition, this shows that B and A, U...U A, are independent. 
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Tricky little problems 


Sometimes very simple problems are enlightening, for example if they illus- 
trate the need to be careful when setting up an abstract model of a real-world 
setting. In this short chapter we’ll work through two well-known examples. 


6.1 One or two successes 


The happy Sam problem A local charity has a booth at the fair. This 
booth offers donors the opportunity to play a game. In this game, your 
chance of winning is p, and if you win, you receive a small prize. 

Let us think about a donor named Sam, who visits the fair one afternoon. 
He plays the game exactly twice during the afternoon. Sam is easy to please, 
so he is happy if he wins at least one prize. If he does not win any prize, he 
is unhappy. 

Let A be the event that Sam wins both times he plays the game. We 
assume that the results of the two games are independent, so the probability 
of A is p?. This is what we will call the unconditional probability of A in 
this setting. 

His friends meet Sam some time after he leaves the fair. They know that 
Sam played two games at the fair, but they do not know the results of the 
two games. However, they observe that Sam is happy. Thus his friends know 
that Sam won at least one game at the fair. Based on their information 
about Sam, what probability should his friends assign to A? 

We can use a sample space for Sam’s games which is similar to the space 
for two coin tossses (Example[5.4). Let 2 = {(W, W), (W, L), (L,W), (£, L)}. 
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Sam’s friends want to know P(A| 8B), where the abstract event B is given 
by B = {(W,W), (W, L), (L,W)}. The abstract event A for this sample space 
is of course {(W,W)}. 

Notice that AC B,so AN B=A. Let g=1-—p. 

Since P(B) = p? + 2pgq, the conditional probability formula tells us 


that 
2 


PAlpe— 2 et 
p?+2pq pt2q Il1+¢q 
When p = 1/2, this says that the probability that Sam won both games is 
only 1/3. 
For comparison, consider a different problem about Sam’s games. 


The Sam’s witness problem The afternoon of the fair, you are strolling 
through the fair, and you happen to pass by the charity booth at a moment 
when Sam is playing one of his two games. You observe that Sam wins. 
You don’t see Sam again that day. 
Like Sam’s friends, you know that the event B has occurred. However, 
you also have some additional information. What probability should you 
assign to A? 


Exercise 6.1. Solve the Sam’s witness problem. 


The difference between these two problems about Sam may be evident, 
but in many problems such differences may be obscured by the wording. 
Here is an example of a loosely phrased version of the happy Sam problem: 
“Sam played two games at the fair, and he won at least one game. Find the 
probability that he won the other game.” 

A common variant of this problem: “A couple has two children. Given 
that one of the children is a girl, find the probability that the other child is 
a girl.” 


6.2 The Monty Hall problem! 


This is well-known, but worth reviewing. A good history of this problem is 
given in [10]. Apparently some mathematicians refused to believe the correct 
answer. The embarrassing details are given in section 1.10 of [i0}. 
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At any rate, here is the problem. It is inspired by a game which was 
sometimes played on the television show Let’s Make a Deal, hosted by Monty 
Hall. The idealized version of the game which is described here may not 
match what actually used to happen, so our Monty Hall is not quite the real 
Monty Hall. 

We assume that in this game, three doors are visible to the contestant, 
and the contestant is asked to choose one of these doors. The contestant will 
be awarded whatever prize is concealed behind the selected door. There is 
a valuable prize, perhaps a sports car, behind one door, and something very 
disappointing is behind each of the other two doors. 

Of course we assume that the prize can lie behind any door, with no door 
favored. The contestant has no idea which door has the prize. 

So far, so good. Now comes the twist. We assume that after the contes- 
tant has chosen a door, but before revealing whether the contestant’s guess 
was correct, Monty Hall often opens one of the two doors which were not 
selected by the contestant, always revealing one of the disappointing non- 
prizes when he does so. Then, Monty Hall offers the anxious contestant an 
opportunity to switch his or her choice to the other unopened door. 

The basic question here is whether the contestant would benefit by switch- 
ing. 

We begin by focusing our attention on the original choice by the contes- 
tant. Let C' be the event that the contestant’s choice is correct. Since no 
door was favored in setting up the game, P(C) = 1/3. 

Here P refers to probabilities based on the information we have before 
Monty Hall opens a door. 

But notice that Monty Hall does not physically move the valuable prize. 
So if the contestant’s choice is correct at the moment of the choosing, the 
contestant’s choice is correct for ever. If the choice is wrong, it stays wrong. 
We also know there is only one alternative left after Monty Hall has opened 
a door. Thus, if the contestant’s original choice was wrong, the contestant 
should switch to the other unopened door. 

The contestant will choose correctly approximately 1/3 of the time, and 
incorrectly 2/3 of the time. Hence the policy of “always switching” pays off 
2/3 of the time, while “never switching” pays off 1/3 of the time. So switch! 

That answers the basic question. But there seems to be something about 
the Monty Hall problem that makes people doubt the answer. It is not com- 
pletely clear why. Monty Hall’s actions do complicate the problem. But 
sloppy wording of the problem can also cause trouble, if Monty Hall’s proce- 
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dure is not explained precisely. 
Consider the following variation of the Monty Hall problem. 


Example 6.1 (The defective door problem). Suppose Monty Hall is on 
vacation. In the absence of a skilled host, the manager of the game show 
decides that they can only provide a simplified version of the game. The 
contestant will choose a door, and will then be given whatever prize lies 
behind the door. 

However, fate is about to intervene. 

After the contestant has chosen a door, one of the other doors suddenly 
swings open. The door must have been defective in some way, although no 
one knew about this until now. Perhaps a vibration in the floor, or a gust of 
wind, has now made the door open. 

We think that the location of the prize does not effect the condition of 
any of the doors, and the location of the prize cannot cause any door to open, 
or not open. However, it happens that the door which opened is not the one 
concealing the valuable prize. So the contestant’s door and one other door 
remain unopened, and we know that one of them hides the valuable prize. 

The manager of the game show notices that this accident has presented 
the audience with exactly the same situation that would have been the result 
of Monty Hall’s usual antics. To live up to the expectations of the audience, 
the manager offers the contestant the opportunity of switching his or her 
choice to the other unopened door. 

We ask the same question as before. Does the contestant benefit by 
switching? 


Exercise 6.2. Please solve the defective door problem. 


Example|6.1]seems more natural than the Monty Hall problem, and some- 
times people may solve the wrong problem. 


Exercise 6.3 (Mega-Monty). In order to convince people that switching 
is the right policy for the standard Monty Hall problem, the following varia- 
tion is sometimes presented. An argument which is claimed to work for the 
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standard Monty Hall problem can be “stress-tested” on this version of the 
problem. 

Suppose that for a special edition of the game show, a long hallway is used, 
with 100 doors. The prize is behind one of the doors, and the miserable 
contestant must choose one door. After the choice has been made, in a 
surprising act of generosity Monty Hall opens 98 of the remaining doors, 
none of which have prizes, and then offers the contestant a chance to switch 
his or her choice to the other unopened door. 

Again we ask, does the contestant benefit by switching? 


Exercise 6.4 (Monty with a tell). 


Part 1 Returning to the Monty Hall game with the usual three doors, 
imagine you have been invited to appear as a contestant. The game will be 
hosted by Monty’s sister, Ivy Hall, who sometimes replaces Monty. 

You prepare by carefully watching recordings of all previous shows for 
which Ivy was the host. By the end of each show the audience knows where 
the prize was located for that show, so that information is in the recording. 

The three doors for this contest are arranged in a line going from left to 
right. You excitedly notice the following behavior pattern. Whenever the 
door originally chosen by the contestant is the door with the valuable prize, 
so that Ivy Hall can choose which of the two remaining doors to open, Ivy 
always opens the remaining door on the left. 

Eventually you appear on the game show, and select your door. And then 
Ivy Hall opens ...the remaining door on the left! 

At this point Ivy Hall offers you the usual opportunity to switch your 
choice. As you stand there, weighing your chances, Ivy notices your indeci- 
sion, and makes an unusual extra offer. She will pay you an additional $100, 
win or lose, if you do not switch. 

What should you do? 


Part 2 Suppose the same situation arises as in Part 1, except that in this 
case you observe that Ivy Hall has opened the remaining door on the right. 
Everything else is the same. 

What should you do? 
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6.3 Solutions for Chapter |6| 


Solution (Exercise (6.1). The witness doesn’t just know that Sam won 
one game, the witness can also specify which game it was, namely the game 
that was played while the witness walked by. 

The other game Sam played is well-defined, and the result of that game is 
of course independent of anything that happens in the game that the witness 
saw. 

Sam won the game that the witness saw, so the probability that Sam won 
both games is just the probability that Sam won the other game, and this 


is p. 


Solution (Exercise (6.2). Notice that the choice made by the contestant 
has no effect on where the prize is, and has no influence on the opening of the 
defective door either. This is in contrast to the situation when Monty Hall 
opened a door, since Monty never opens the door chosen by the contestant. 

Monty is also careful to avoid revealing the valuable prize. 

Sometimes arguments in physical situations seem easier to understand 
when phrased in terms of frequencies. Readers may wish to try that. But 
we'll mostly stick to probabilities here. 


Method 1 The idea of the solution is easy to state: we will use symmetry. 

After noting that the choice of a door by the contestant has no effect on 
anything else, we can ignore the contestant, and simply look at the doors. 
After the defective door has opened, we see two remaining unopened doors. 
One of these doors has the valuable prize. Nothing in the description of the 
problem (except the contestant’s choice) treats either of these doors differ- 
ently. So the valuable prize is equally likely to reside behind either door. 

There is no reason to switch. 

Now we will give a more formal version of the same argument. 

Let’s label the doors as door 1, door 2 and door 3, where door 1 is the 
defective door. 

Let V; be the event that the valuable prize is located behind door 27. 

Let P denote the probability distribution based on all the information 
available to the contestant before the door opens. This include the knowledge 
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of which door the contestant chose, but that bit of knowledge has no relevance 
to P(V;), since the choice of a door by the contestant has no effect on either 
the placing of the valuable prize. 

As usual, P(V;) = 1/8 for all i. 

Let W, be the event that door 1 was defective, and it opened. That is 
one piece of new information that the contestant gains when the door opens. 

The contestant then has a chance to see what was behind the door, and 
so gains an additional piece of new information, namely that the prize is not 
there. Thus the contestant also learns that V,° has occurred. 

So the total new information gained by the contestant is VM W;. 

Let M=VENW,. 

Since V2 C V;°, we have V2 Vi = V2. So by the multiplied-through 
version of the conditional probability formula (equation (4.2), 


P(MN V2) = PVE W101 Va) = P(W1 9 V2) = P(W, | V2)P(V2). 
Similarly 
P(M V3) = PWV Wi V3) = P(W1 V3) = P(Wi | V3)P(V3). 


The location of the prize does not effect the condition of any of the doors, 
and the location of the prize cannot cause any door to open, or not open. 
So 
P(W, | V2) = P(W,1) = P(W; | V3). 


Thus 
P(M NV.) = P(M V3). 


Notice that in frequency language, this equation says that contestants will 
find themselves in situation MW just as often when the valuable prize is behind 
door 2 as when it is behind door 3. 
In probability language, we have 
P(MNV) P(MNY3) 


P(V2|.M) = PU) ~ Pan = P(V;| M). 


Since 


P(V2|M) = P(V3|M), 


the contestant has no reason to switch. 
That finishes the solution. 
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Method 2 


If you include the contestant’s choice of a door in the discussion, you might 
say something like the following. 

From the contestant’s viewpoint, the opening of the door is a random 
event, independent of everything else. The chance of any particular door 
opening is the same, a small probability. 

Let C be the event that the contestant’s original choice of a door was 
correct, meaning that it is the one with the valuable prize. 

As usual P(C) = 1/3, so P(C*) = 2/3 = 2P(C). 

Let M be the event which describes the new situation after the door 
opened. When deciding whether or not to switch, the contestant should be 
interested in P(C'| M). 

The key idea in this approach: we are dealing with the situation in which 
the door that opened was neither the door with the prize nor the door picked 
by the contestant. Common sense probability tells us that this is twice as 
likely to happen if the contestant chose correctly, since then the door with 
the valuable prize and the door picked by the contestant are one and the 
same door, and there are two possible choices for the defective door. 

Using our notation, P(M|C) = 2P(M|C°*). 

The multiplied-through version of the conditional probability formula 
(equation (4.2)) tells us that 


P(MNC) = P(C)P(M|C) and P(MNC®) = P(C°)P(M | C°). 


The fact that P(W | C) = 2P(M | C°) exactly compensates for the fact that 
P(C°) = 2P(C), and we have 


P(MNC)=P(MNC*). 


In frequency language, this equation says that contestants will find them- 
selves in situation MW just as often when the chosen door is correct as when 
it is incorrect. 

And 

P(MNC) P(MNC) 


P(CIM) = Sap = pag = PIC). 


Thus the chance of getting the valuable prize in this situation is the same, 
whether or not you switch, and there is no reason to switch. 
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Solution (Exercise (6.3). Now the probability that the original guess was 
correct is 1/100. This happens approximately 1/100 of the time. And so 
switching brings success approximately 99 times out of 100. 

Switch! 


Solution (Exercise (6.4). Part 1 Let L be the event that Ivy Hall opens 
the remaining door on the left. 

Let C’ be the event that the contestant picked the correct door. From 
your study of past Ivy Hall shows, you know that P(L|C) = 1. 

If C° occurs, then the prize is behind one of the two doors which the con- 
testant did not pick. Knowing that C occurred does not give us information 
which favors either of those two doors. 

If C° occurs, Ivy Hall must of course open the remaining door which does 
not have the prize, wherever it is, left or right. Thus P(L|C*°) = 1/2. 

By the Law of Total Probability (Theorem /4.3), 


P(L) = P(C)P(L|C) + P(C*)P(L|C*) = ; 1+ ; ; — ; 
= P(CNL) P(C)P(L|C) 41 1 
a) 3 
P(C|L) = a a = = 5: 


Since switching does not improve your chances of winning the valuable prize, 
stick with your choice and take the $100. 


Part 2 If the door you chose had been wrong, Ivy would have chosen the 
remaining door on the left. She didn’t do that. 
Don’t switch! 
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Independent sequences 


The most important facts in this chapter are Definitions [7.1|and[7.7| and the 
formulas in Section 

We've discussed independence for two events. But often we want to con- 
sider more than two events, perhaps even a long sequence of events. 


7.1 Sequences of experiments 


Consider a sequence of n experiments. We can call each repetition of the 
experiment a “trial”. 


Definition 7.1 (Independent trials). We will say that a sequence of n 
experiments is independent if, for each k <n, information about the results 
for trials 1,...,4 does not change our opinion about the probability of any 
properties of the result of trial k + 1. 

For each 7, let D; be a physical event that is defined completely by the 
outcome of the i-th experiment. If the sequence of n experiments is inde- 
pendent then we will say that the sequence D,,...,D, is an independent 
sequence of physical events. 


Suppose that, based on our experience, we think that a certain sequence 
of experiments is independent in the sense of Definition [7.1] 

Let D; be an event that is defined completely by the outcome of the i-th 
experiment. Let V be an event which is entirely determined by the results 
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of trials 1,...,k. By Definition knowing that V occurs does not affect 
our opinon about whether or not D4; occurs. That is, 


P (Dry | V) = P (Des) (1) 


The independence of the sequence means that equation (7.1) should hold for 
any event V which is “entirely determined by determined by the results of 
trials 1,...,k”. 

An example of such an event V is the event Di... D x. So let’s take 
V =D,N...N Dz, in equation (7.1). This gives: 


Pal DitinciDe =P Dea) (7.2) 


Then the multiplied-through version of conditional probability formula (equa- 
tion (4.2)) says that 


P (DN... DeN Dest) =P (Din... Dg) P (Dew). (7.3) 


If you set & = 1 in equation (7.3), you see that equation (7.3) simply says 
that D,, Dz are independent. Indeed, that’s how we arrived at the definition 
of mathematical independence in Section 

Please do the next exercise! 


Exercise 7.1. Let D,,...,D, be the sequence of independent trials. Show 
that for every k, 


P(D,N...A Dx) =P (D,)...P (Dx) - (7.4) 


The calculation in Exercise is similar. 


Definition says what we mean by independence for a sequence of 
physical events. It is not a definition of mathematical independence for an 
abstract model, although of course consequences such as equation (7.4) must 
hold in any valid model for an independent sequence of experiments. We’ll 
think later about making a precise mathematical definition of an independent 
sequence. 

For now, let’s calculate some consequences of Definition 
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7.2 Outcome probabilities when tossing a coin 
nm times 


In the present section we perform a key computation. 

Consider n tosses of a coin. The coin need not be fair. The probability 
of a head is some number p and the probability of a tail is q = 1 — p. 

We mentioned in Example that when the result of a toss is a head, 
we sometimes say that result is success. Using this sort of language can be 
briefer, and it also makes it a bit easier to adapt our results about coin-tossing 
to other situations which are similar. 

We'll usually record the success or failure of a toss using a number, either 
0 or 1. Success is represented by 1 and failure is represented by 0. 

The record of successes and failures for a whole sequence of n repeated 
tosses is then a sequence (%1,...,%n), where for each j, x; is either zero or 
one. We can call this sequence the “success record”. 

Clearly there are 2” possible success records. 

Should we use (21,...,2,) as the sample point that records the whole 
result of the experiment when the coin is tossed n times? Then our sample 
space will simply be the set of all possible sequences of this sort. 

We certainly can use that sample space, but the probability argument 
may clearer if we simply just talk about events, without committing to a 
particular sample space representation. 

For each j = 1,...,n, let W; be the event that toss 7 produced success. 
Let Dj = W; and let D} = W¢. 

For any sequence (21,...,2%») of zeros and ones, Dj? M... D2” is the 
event that: 


(toss 1 produced x,) and (toss 2 produced x2) and ... and (toss n 
produced x,). 


Thus Dj? N...9 D%” is exactly the event that the success record is equal to 
Cee 

For coin tosses, the tosses are independent trials. Thus events defined in 
terms of different tosses are physically independent. Thus by equation (7.4), 
for any success record (21,...,%n) we know that 


P(D™ 9... D®) = P(D%)---P (D*), (7.5) 


Equation (7.5) is the key probability fact we need. 
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We know that Dn’ = pif x; = 1 and D,’ = q if z; = 0. Thus 
PLO Tut =p ©, (7.6) 


where k is the number of indices j such that x; = 1. 
We have proved: 


Lemma 7.2 (Coin toss outcome probabilities). For any sequence (21,...,2n) 
of zeroes and ones, the probability of obtaining exactly that success record 


is p*g"-*, where k is the number of successes in the sequence (21,..., pn). 


Lemma /7.2]lets us calculate a very useful probability, in the next lemma. 


Lemma 7.3 (Probability of obtaining k successes). Let G;, be the prob- 
ability that n tosses produce exactly k successes. Then G; is the union of 


events Di? N...9D%", over all sequences (21,...,2%n,) which have k successes. 
Furthermore, 
n! 
PG =. 2 (ee 
(Gr) Hmm? (7.7) 


Here 0! is interpreted as 1 in case that k = 0 or k =n. 


Proof. Let m be the number of distinct sequences (#1,...,%,) which contain 
exactly k ones. 
Lemma [7.2] and the additivity of probability tells us that 


P(G;) = mp*q"*. (7.8) 


The number m depends on n and k. 
Equation (7.9) will give us equation (7.7), once we show that 


n! 
Rin — ky 


How do we show that equation holds? This is a counting problem. 

We could explain right now how to count the number of sequences con- 
taining the k ones. But it seems more efficient to do that in Lemma |8.2| as 
part of a general discussion. So we will leave equation (7.9) as an I.O.U. for 
the moment. 


m (7.9) 
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Exercise 7.2. Equation (7.9) tells us that m =n when k = 1. Verify in this 
special case that m = n is the correct value. 


Exercise 7.3. Consider the experiment of tossing a fair coin 5 times. Let A 
be the event that the first three tosses produce at most 1 head in total. Let 
B be the event that the last two tosses produce exactly 1 head in total. Find 
P(AN B). 


Exercise 7.4. Consider n tosses of a fair coin. 

Let A be any event on your sample space 2. Prove that P(A) can be 
written as a fraction whose denominator is a power of 2. 

Exercise considered methods of simulating an experiment with three 
equally likely outcomes. In such an experiment, each possible outcome must 
have probability 1/3. The present exercise shows that tossing a fair coin can 
never perfectly simulate such an experiment. 


Exercise 7.5 (Counting sequences). Consider tossing a coin 30 times. 
Let D} denote the event that toss 7 produces a head and let D? denote 
the event that toss 2 produces a tail. 
Using the sequence model for the experiment of 30 coin tosses, please 
answer the following questions. 


(i) How many sample points are there in Dz? 
(ii) How many sample points are there in Di N D9? 
(iii) List all the sample points in D} N D?M D3}N DEN...N DjgN D8? 
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7.3 Bernoulli trials terminology 


Like coin-tossing, many experimental situations involve repeated indepen- 
dent experiments, each of which either results in an event called “success” 
or an event called “failure”. The next definition provides a convenient name 
for such experiments. 


Definition 7.4 (Bernoulli trials). Let W,,...,W, be independent events, 
each of which has the same probability p. We will say that the sequence 
W,,...,W, form a sequence of Bernoulli trials. We will often refer to the 
occurrence of W; as success on trial 7, and the occurrence of W* as failure 
on trial 7. The probability p will be called the probability of success. 

We also speak of the experiments and models associated with the events 
W,,...,W, as Bernoulli trials. 

Tossing a coin n times, when the probability of a head is p, gives an 
example of Bernoulli trials, provided we interpret a head as “success” on 
any toss. The event W; in this situation is simply the event that a head is 
obtained on toss 2. 

Any mathematical statement about a Bernoulli trial sequence can be 
translated into a mathematical statement about a coin-tossing sequence with 
the same success probability. Thus we are free to use either Bernoulli trial 
language or coin-tossing language to describe the relevant concepts. 


Translating Lemma into the language of Bernoulli trials gives the 
following. 


Theorem 7.5 (Probability of k successes). Let W,,...,W,, be a sequence 
of Bernoulli trials with success probability p. 

Let P be the appropriate probability for the model. Let G; be the event 
that exactly k successes occur. Then 


_ n! k n—k 
“Hap 2 7 


where 0! is interpreted as 1 in case that k = 0 or k= n. 


P (Gx) (7.10) 


One often writes AoE as ta so equation (7. 10) can also be written as 
n es 
P(G,) = ({,)o% ‘ (7.11) 
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Figure 7.1: P(k heads) in 30 tosses, success prob 1/3. 


() = aoe 


The expression ( ia) is called a “binomial coefficient”, since it appears in the 
statement of the Binomial Theorem (equation (8.5)). 


where 


Definition 7.6 (The binomial distribution). The distribution given by 
equation (7.10) is called the binomial distribution with parameter p. 


Figure shows a plot of P(G;) versus & for 30 trials, with success 
probability 1/3. Notice that the graph appears to be centered around k = 10, 
although it is not symmetric around that point. Since the success probability 
is 1/3, the average number of successes over many repetitions of a sequence 
of 30 trials is also equal to 10, since (1/3) * 30 = 10. 

Also notice that those probabilities in Figure get awfully small when 
you move a moderate distance away from 10. 
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7.4 Mathematical independence for a sequence 


Building on our experience with coin-tossing, let’s give a general mathemat- 
ical definition. 


Definition 7.7 (Independence for n abstract events). Let W,,...,W, 
be a sequence of events in some probability model. Suppose that for every 
choice of events D;,...,Dn, where each D;, is either W; or Wf, the following 
equation holds. 


P (DN... Dy) =P (D1)++-P (Dp): (7.12) 


Then we say that the sequence Wj,...,W,, is an independent sequence 
of events in the probability model. Be careful to note that independence is a 
property of the whole sequence, not of each event W; by itself! Nevertheless, 
for brevity we do often express independence by saying that “the events 
are independent”, rather than saying that the events form an independent 
sequence. 


Notice that equation (7.12) gives us 2” equations, when we substitute for 
D,,..., Dry in all possible ways. That’s a lot of equations! 

Why should we think that equation is a reasonable definition? 

Well, suppose someone gives you a sequence of independent physical 
trials, in the sense of Definition and you are given physical events 
W,,...,W,, that are such that each W; is defined entirely in terms of tri- 
als Dee s5 2 - 

Suppose that W; is the abstract event in the model which represents W;. 
Then for every sequence D,,...,D, of the sort described in the definition, 
where each D; is either W; or W¥, it is certainly true that each D; is defined 
entirely in terms of trials 1,...,2. So, by Exercise [7.1] we must believe that 
equation holds. 

But is that enough for a definition? Perhaps physically independent 
events have more properties, which are not captured by those 2” equations 
given in equation (7.12). Should we worry about that? A reassuring answer 
is given by the fact that the 2” events D, N...M D, cover all the possible 
cases of what can happen in n tosses. In other words, anything that can 
be said about the outcomes can be expressed in terms of set operations on 
events of the form D,; MN... Dn. 


i72 


7.4. Mathematical independence for a sequence 


So we are satisfied that Definition is sound. But is it beautiful? It 
is a lot of equations, isn’t it. On the other hand, all 2” equations follow the 
same pattern. So it’s not too bad. 

There is one exceptional case: for n = 2, Lemma shows that the 
single equation P(W, NW.) = P(W,)P(W2) implies all four of the equations 
obtained by substituting in equation (7.12). 

This shows that the sequence W,, W2 is independent in the sense of Def- 
inition [7.7] if and only if W,, W2 are indendent in the sense of Definition [5.2] 
That is good, since it avoids ambiguity when we use the word “independent”. 

It’s too bad things are more complicated when n is greater than 2. 

But don’t worry! You will usually find that your physical understanding 
of independent events will let you guess the correct equations for any practical 
problem. (We used this approach in Method 1 for the solution of Exercise|?.3] 
And of course that’s how our whole discussion of independent sequences got 
started, leading us to equation (7.1). ) Physical reasoning lets us go directly 
to the calculations we need for independent sequences, although it is not 
sufficient for a general proof. 

Incidentally, when we cover independent random variables in Chapter [12] 
you will see a neater way to describe mathematical independence. 


Remark 7.8 (Order does not matter for independence of sequences). 
Note that if D,,...,D, is an independent sequence as defined in Defini- 
tion then any reordering of the sequence is also independent. This is 
true because the intersection of sets does not depend on the order in which 
they are listed, and the product of their probabilities is also the same regard- 
less of the order of the factors. 


Your physical understanding of independence will make you confident 
that the next exercise is correct. But working out the solution is a good way 
to get a feeling for the mathematical definition. 


Exercise 7.6. Let A,B,C’ be three sets which are mathematically inde- 
pendent in the sense of Definition Based only on the mathematical 
definition, prove the following. 


(i) Show that A, B are independent. (Suggestion: consider AN BNC and 
ANB iG*.) 


173 


Chapter 7. Independent sequences 


(ii) Show that AM B and C are independent. 


Remark 7.9 (The length-one case). When n = 1, we should interpret 
Dyn...ADn simply as D,. It follows that any length one sequence satis- 
fies Definition Thus every sequence of length one is a (rather boring) 
independent sequence. 

Notice we are not saying that D, is independent of itself. That would be 
a statement about the length-two sequence D,, D4. 


Whether or not you work Exercise please be aware of the danger it 
points out. 


Exercise 7.7 (Pairwise independence is not enough). Here’s an im- 
portant observation that comes up once in a while. For three events A, B,C, 
suppose that you know all possible pairwise independence statements hold, 
i.e. 


e A, B is an independent pair, and 
e B,C is an independent pair, and 
e A,C is an independent pair. 


You still cannot be sure that A, B,C is an independent sequence. 

Here’s an example. Consider tossing a fair coin twice. Let A be the event 
that the first toss produces a head, and let B be the event that that second 
toss produces a head. Let C’ be the event that the results of the two tosses 
agree, that is, C = (AN B)U (ASN BY). 

The statement of the experiment tells us that A,B are independent. 

Show: 


(i) that also B,C are independent and A, C' are independent, but 


(ii) 
P(AN BNC) £P(A)P(B)P(C). 
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Thus A, B,C is not an independent sequence. 

(See Figure [5.1] The event C' is the union of two of the four pieces shown 
in part (c) of the figure. Notice that each of the four pieces in part (c) has 
probability 1/4, so calculations should not be hard.) 


7.5 Thinking about consistency again 


In the case of two tosses of a coin, it was mentioned earlier that our physical 
experience makes us confident that probability models for two tosses are 
consistent with models for one toss. And we checked that in Exercise 

Let’s look now at more general sequences of tosses. As in Example |2.6| 
suppose you are studying tossing a coin 1,000,000 times, and using the 
sample space consisting of sequences (21, ..., 21000000), Where each 2; is either 
1 or O. 

In this section we will check a couple of things. 

First, let’s check that the probabilities of the outcomes add up to one. 

The following notation is handy. 

Let 6(1) = p and let 0(0) = 1—p. Notice, by the way, that 6(1)+0(0) = 1. 

Using the # notation, equation (7.6) can be written neatly as 


PUD ite Dy Si) os oO) (7.13) 
If we want to show that the probabilities of the outcomes add up to one, 


we must show that 
> CG OG) =1, 


Lyn 


where the sum in this equation is over all possible values for 71,...,%,, and 
each x; can be 1 or 0. 

We have to do something with that big sum on the left side of the equa- 
tion. 

Using the distributive law as much as possible, we see that 


(9(1) + (0))...(A(1) + (0)) = S> A(a1)...0(@n). 


n factors Breen 
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Since 6(1) + (0) = 1, 


yy GG) Ms oO Gy) Sl Kaen KL, 


Di parlln n factors 


so the probabilities of the outcomes do indeed add up to one! 
Here’s another exercise in checking. 


Exercise 7.8 (Consistency). You’ve tackled this problem already in the 
case of a fair coin, in Exercise Now consider the general case, when the 
coin is not necessarily fair, and has success probability p. 

Fnd P(D{) using the million-toss sample space. Remember, no peeking 
at the one-toss space! 


7.6 Solutions for Chapter 


Solution (Exercise |7.1). Equation (7.4) and equation (7.3) are identical 
when k = 1. 


Now we run the usual induction argument. Let the number of trials be 
n. Assume that equation (7.4) holds for some k > 1. If k < n, then by 


equation (7.3), 


PU. An DSP Ah ania Aas: 


Substitute for P(A; M...M Ax) in this equation, using equation (7.4). 

The result is equation with k replaced by k + 1, so the induction 
step is verified. 

It follows that equation holds for every k = 1,...,n. 


Solution (Exercise|7.2). Let (a1,...,%n) be a sequence of ones and zeroes, 
for which exactly one of the numbers x; is equal to one. 

There are n choices for the index 7 with 7; = 1. Hence there are exactly 
n sequences of this type. 
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Solution (Exercise [7.3). Method 1 We consider the first three tosses as 
a separate experiment to find P(A). The sample space consists of sequences 
of length 3. A = {(0,0,0), (1, 0,0), (0, 1,0), (0,0, 1)}. Thus P(A) = 4(1/8) = 
1/2: 

We consider the last two tosses as a separate experiment to find P(B). 
B= {(1,0), (0,1)},60 P(B) =2(1/4) = 1/2. 

Using our physical understanding of independence, A and B should be 

independent, since they depend on separate tosses. Thus P(ANM B) = 
(1/2)(1/2) = 1/4. 
Method 2 The sample space for the five tosses consists of 32 sequences of 
zeroes and ones. All have the same probability, so P({w}) = 1/32 for every 
sample point w. 

A consists of all sequences of the form 


(0,0, 0, ¢4,%5) or (1,0, 0,24,95) or (0,1, 0,974,975) or (0,0, 1, @4,-25); 


where 24,25 can be zero or one. Thus A contains 4 x 2 x 2 points, so 
P(A) = 16/32 = 1/2. 
B consists of all sequences of the form 


(Say v2, 03, i, 0) or Cae L2, U3, 0, 1), 


where 21, £2, %3 can be zero or one. Thus B contains 2 x 2 x 2 x 2 points, so 
P(B) = 16/32 = 1/2. 

Consider a sample point (21, %2, 23, %4, 25) € ANB. There are two possible 
cases: 


e x; = 0 for all2 = 1,2,3 and x; = 1 for exactly one of the indices 7 = 4, 5. 
There are 1 x 2 = 2 ways to choose 21, 2,23, 24,25, SO there are two 
sample points for this case. 


e x; = 1 for exactly one of the indices 7 = 1,2,3, and x; = 1 for ex- 
actly one of the indices i = 4,5, There are 3 x 2 = 6 ways to choose 
X1, 02,03, L4, £5, 80 there are six sample points for this case. 


Since AM B contains 8 sample points, P(ANM B) = 8/32 = 1/4. 


Of course, Method 1 is more efficient, and conceptually clearer. 


Solution (Exercise 7.4) . Each outcome has probability 1/2". By Theo- 
rem |2.16} if an event A contains £ outcomes then P = ¢/2*. 
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Solution (Exercise |7.5). (i) Let (x1,...,230) be a sample point in D3. 
Then x5 = 1. For each index i 4 5, there are two possible values for 2;. 
Hence there are 27° choices for (21,..., 239), 80 |D5| = 27%. 


(ii) Let (x1,...,239) be a sample point in Dz D9. 

Then x5 = 1 and x7 = 0. For each index distinct from 5 and 7, there are 
two possible values for x;. Hence there are 27% choices for (x1,...,239), 80 
Oe DE) = OP" 

(iii) Let (v1,...,230) be asample point in DIN DYNDZN DYN. ..ADi9ND§p. 

Then 2; = 1, % = 0, 23 = 1, 4, = 0, ete. Thus 2; = 1 if 2 is odd and 
x; = 0 if 7 is even. This is the only sample point in the event. 


Solution (Exercise (7.6). (i) From the definition of mathematical in- 
dependence, P(AN BNC) = P(A)P(B)P(C), and also PPAN BNC*) = 
P(A)P(B)P(C*). 

Since P(C) + P(C*°) = 1 we have PAN BNC) +P(AN BNC) = 
P(A)P(B). 

Also (AN BN C)U(AN BNC’) = ANB, and the sets in this union 
are disjoint because C' and C° are disjoint. Hence by additivity we have 
P(AN B)=P(ANBNC)+P(ANBNC). 

Thus we have shown that P(AN B) = P(A)P(B). 


(ii) By definition, P(ANBNC) = P(A)P(B)P(C). By part (i), P(A)P(B) = 
P(AN B). Hence P((AN B)NC) = P(AN B)P(C), as claimed. 


Solution (Exercise [7.7). From the statement of the problem P(A) = 1/2, 
P(B) = 1/2, and A, B are independent. 

Since A, B are independent, P(AN B) = P(A)P(B) = 1/4. 

Using Lemma PAin Be) =] PAP(e. = 1/4. PA 7B) = 
P(A‘°)P(B) = 1/4, and P(ASN B°) = P(A°)P(B*) = 1/4. 

Since C = (AN B)U (ASN B®), P(C) = 1/44 1/4 = 1/2. 


(i) Then P(ANC) = P(ANB) = 1/4, P(BNC) = P(AN B) = 1/4, so 
A,C and B,C are independent. 

(ii) However, AN BNC = ANB, so P(AN BNC) = 1/4, while of course 
P(A)P(B)P(C) = 1/8. 


Solution (Exercise |7.8). Let n = 1,000, 000. 
When the coin has success probability p, and p # 1/2, sample points are 
not equally likely. So we need to use independence. 
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The event Dt consists of points of the form (1, 22,...,2n). 
That is, 
Dy =4 (to, . 5 2) 2 where zy =O or 1 forg@=2,....n). (7.14) 
Thus 
P(Di) = S > 0(1)6(22)...0(an) =p S> O(a2)...0(tn), 
Rdg My LQ yD 

where the sum is over all possible values of %2,...,2%,, and each x; can be 
either 0 or 1, fori = 2,...,n. 


By using the distributive law as much as possible, we see that 


(0(1) + 0(0))...(A(1) +0(0)) = S— O(a2)...0(@n). 


n—1 factors 


Hence 


TQ y--9n n—1 factors 


and so P(D}{) = p, as it must. 
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Counting 


8.1 Counting ordered and unordered choices 


Basic combinatoric methods, i.e. counting “permutations and combinations” , 
are essential in analyzing many probability problems. 


8.1.1 Ordered choices 


Imagine that you are choosing a small administrative board to serve a club 
with n members. Three positions must be filled: president, vice-president 
and treasurer. No person can fill more than one position. The members of 
the board must be members of the club. 

Think of choosing the president first. This can be done in n ways. Next 
choose the vice president. Since one member of the group is already assigned 
to a position, you have n — 1 choices for the vice-president. Finally, choose 
the treasurer from the remaining n — 2 people. The total number of ways to 
do this is then n(n — 1)(n — 2). 

Is it obvious that we should count the total number of ways by multiplying 
the number of choices at each step? Sometimes people picture a “tree of 
possibilities” to see this. 

We can also compare this calculation with another one. 

Suppose that a company owns 5 apartments buildings, and each building 
has 3 floors, and each floor has 7 apartments. If you are asked to choose 
one of the company’s apartments to live in, you have a total of 5 x 3 x 7 
choices. Notice that in this situation, if you make a different choice at step 
one, you have a completely different set of choices for step two, and so on. 
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Perhaps that makes it slightly easier to picture what is going on. In the 
case of choosing a board, if you make a different choice at step one, the set 
of possible choices for step two is only altered slightly. However, because 
you have already made a different choice at step one, there is no danger of 
counting the same member twice. 

Of course, in any calculation like this, the fact that we can simply multiply 
the number of choices at each step depends on the fact that the number of 
possible choices at each step does not depend on what choices were made in 
previous steps. 

The number of sequences of k distinct elements chosen from a set of 
n elements is often denoted by P;’. In case it is needed in a formula, we 
interpret Pj’ as 1, which means that we think there is only one way to choose 
nothing. 

The argument just given for choosing a board tells us that 


| 


n! 
ae —1)...(n—k+1) = — —. 8.1 
P=n(n—1)...(n- k+l) = (8.1) 

For k = n we use the standard convention that 0! = 1 in this formula. 

As a matter of terminology, a sequence of k distinct elements chosen from 

a set S is sometimes called a permutation of length k chosen from S. 


8.1.2 Unordered choices 


Now imagine that you are choosing a “clean-up” committee consisting of 3 
members, from the club with n members. There are no special roles for the 
members of the this committee. They are simply supposed to work together 
to clean up after the next club meeting. You can still choose the members 
one at a time, but choosing the same three people in a different order just 
gives you the same committee. 

If we think about choosing the committee members one by one, and count 
the number of ways to do that, we know that there are n(n — 1)(n — 2) ways. 
But counting these ordered choices means counting the same committee mul- 
tiple times. How many times? 

A given clean-up committee is a set of 3 people. Equation (8.1), with 
k = n = 38, tells us there are 3! ways to perform the ordered choices which 
give us the same committee. 


So the actual number of distinct possible clean-up committees is n(n — 
1)(n — 2)/3!. 


182 


8.1. Counting ordered and unordered choices 


Generalizing the situation just described, let C7’ denote the number of 
subsets of size k from a set of mn members. A brief way to refer to CZ is “n 
choose k’”. Sometimes people refer to a chosen subset as a “combination”. 
Then C7 is “the number of combinations of n things taken k at a time”. 


Lemma 8.1. C7 is given by 


on — Pe _ n(n 1)..- (nk +) n! _ (n 
a k! ~ kin —k)l 


@ = ae (8.3) 


By the definition, C? = 1. This is consistent with equation (8.1) with the 
standard convention that 0! = 1. 


Proof. We could imitate the argument just given when k = 3. But perhaps 
it’s neater to rearrange the argument, as follows. 

Consider choosing a sequence of & distinct elements from a set S' contain- 
ing n elements. When we thought about this choice, we chose the members 
in order, one at a time. But we can also carry out the choice in two stages. 

In stage 1, select an unordered subset A of size k. By the definition of 

;, that can be done in C? ways. We don’t know the actual numerical value 
for Cz yet, but by definition C7 is the number of ways to choose A. 

In stage 2, arrange the elements of A in order. By equation (8.1), with 
n =k, this can be done in k! ways. 

Clearly P;’ is found by multiplying the number of ways to perform stage 1 
times the number of ways to perform stage 2. Hence P? = C?k!, proving 
equation (8.2). 


We define (7) = 0 if k <0 ork >n. This makes equation (8.2) true for 


those values of k. 
The reason ef is called the binomial coefficient will be clear from equa- 


tion (8.5) below (the binomial theorem). 
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Lemma 8.2 (Counting successes using zeros and ones). Let S be the 
set of all sequences (21,...,%n), where each x; is zero or one. 

Let A, be the subset of S consisting of all sequences (21,...,%) such 
that x; = 1 for exactly k indices 7. Then 


Ac (;): (8.4) 


Proof. We can specify any sequence (1,...,2%) by simply specifying the set 
of indices i for which x; = 1. Hence the number of sequences (21,..., 2x) 
which have k successes is exactly equal to the number of ways to choose a 
subset of size k from a set of size n. That is, |A;| = ({). 


Lemma\8.2]is what we need to finish deriving equation (7.9). That equa- 
tion gives the formula for the Binomial Distribution (Theorem [7.5). 


8.2 The binomial theorem 


We have used the binomial theorem from time to time in examples. Let’s 
give a general statement and proof of this theorem now, for comparison with 
the proof that was just given for Theorem [7.5} 

Consider expanding (a + b)”. The usual first step is to write 


(a+b)" = (a+b)(a+b)...(a+6). 


n times 


The next step is to apply the distributive law energetically, resulting in 2” 
terms. Notice that each of your 2” terms is a product of n factors. The 
factors are a’s and b’s, where we choose either a or b from each of the n 
factors in the original expression for (a + b)”. 

To record a term, we could simply note the set of factors (a + b) from 
which we chose a. That completely specifies the term. 

For example, one of the terms in the expansion of 


(a+ b)(a+ b)\(a + b)(a+ b)(a +5) 
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is ababb. We can record that term by saying that we chose a from the first 
and third factors and chose b from the other factors. 

The final step is called “collecting like terms”. Suppose you would like 
to combine all terms which are equal to a*b"-*. How many such terms are 
there? 

The number of such terms is exactly the same as the number of ways in 
which you can select k factors from the n factors in the product (a + b)”. 
Hence there are (7) terms which are equal (after rearranging the order) to 
oo 


This proves the binomial theorem: 


(a+b) = s @ akprk, (8.5) 


k=0 


8.3. Two recursion formulas 


To practice using counting arguments, we’ll prove two recursive formulas for 
the binomial coefficients. 


Here’s the first one. 
n n—1l n—1l 
(= Ca) + G7): 86) 


To prove (8.6), take any set of n elements, and choose one particular 
element for a special role. 

When choosing a subset consisting of k elements, there are two possibili- 
ties. Either your subset contains the special element, or it does not. 

If your subset does not contain the special element, then it is chosen from 
the other n — 1 non-special elements. That can be done in (",') ways. 

If your set does contain the special element, then your subset is charac- 
terized by the k — 1 non-special elements it contains. Those elements can be 
chosen in Ce ways. 


Combining the two cases proves (8.6). 


Exercise 8.1. Check equation using algebra. 
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Remark 8.3 (Pascal’s triangle). This is a pictorial device based on equa- 
tion (8.6). It is used to quickly find small binomial coefficients. One often 
writes coefficients in rows, with (2), k =0,1,...,nin the n-th row. Elements 
to left or right of the binomial coefficients are taken to be zero, and the rows 
are staggered, meaning that each element in row n is placed in between the 
two nearest elements above it in row n— 1. Equation tells us that each 
element in row n is the sum of the two nearest elements in the preceding row. 
Thus: 


01 0 
0110 
O: 2S 0 
013310 


and so on. 


Here’s another recursion formula. For any n > 1 and any k > 1, 


Bae er 


This formula has an easy algebraic proof. 


a) n! _nm -(n—1)! 
@ kink) k(k—-1)'(n — kb)! 


7 (n—1)! afr 
SE ay Capua easy ted Cae 


Just for fun, let’s make up a counting proof too. Think about choosing 
a special clean-up committee made up of & members from a club with n 
members. The members of the committee all must clean, but one member 
of the committee is chosen to be the boss of the committee. That member 
of the committee has a special role: the boss is responsible for making sure 
that the job is done well. 

We can choose the special clean-up committee in two stages: 


First stage: Choose a set of k members. That can be done in (7) ways. 
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Second stage: Select one member from the k people in the chosen set, and 
make that person the boss. This can be done in k ways. 


Hence the total number of possible special committees is 


H(i) 


Alternatively, we can choose the special clean-up committee in a different 
way. 


Alternate first stage: Select one person from n people in the club. The 
selected person will be the boss. 


Alternate second stage: Select the remaining members of the clean-up com- 
mittee from the remaining n — 1 members of the club. This can be 


done in ay) ways. 


Hence total number of possible special committees is 


n—-1l 
n 
k-1 
Equating the two different expressions for the number of possible special 
committees gives equation (8.7). 


The general theory of counting is known as combinatorics. It would be 
enjoyable to explore this interesting area, but we need to restrain ourselves 
and return to probability. 


8.4 Random sets 


8.4.1 Choosing a subset 


Let S be a set containing a total of N elements. Consider the experiment of 
randomly choosing a subset consisting of n elements, in such a way that no 
element is favored. It seems most natural to represent a sample point by the 
actual subset of elements that are chosen. Let 2 be the collection of subsets 
of size n. 

Let x be a particular point in the set S. The probability that x will be 
one of the n random elements chosen is n/N, as was shown in Theorem [2.23] 
Here we consider probabilities of choosing several particular points. 
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Exercise 8.2. (i) For the experiment in this section, find the number of 
sample points in 22. 


(ii) Suppose you have a special interest in two of the elements in S, called x 
and y. Let A be the event that both x and y are in the selected set. Assume 
that n > 1. Find P(A). 


(iii) | Generalize your result in part (ii) to the situation where you are 
interested in a particular set T of elements, with |T| = K. Let Ar be the 
event that all the elements in T are in the selected set. Assume that n > K. 


Example 8.4. In Section|4.6]we considered the situation of Exercise [4.2]and 
found two probabilities. Let’s find the same probabilities using our counting 
tools. 

In Exercise [4.2] we are choosing a set of two jelly beans from a bowl which 
contains 75 yellow beans, 53 red beans, 27 purple beans, and 18 green beans. 

Thus there are a total of 173 jelly beans in the bowl. 

Let R be the event that a set of two red jelly beans is obtained, and let 
M be the event that a set containing one red and one green jelly bean is 
obtained. We wish to find P(R) and P(M). 

Let 2 be the collection of all two-point subsets of the set of jelly beans 
in the bowl. We assume that all sample points are equally likely. Clearly 


173 ig? 172 
i= (() - 


so for each w € 1) we have 


1 2 
P = = ; 

eh) = oy = 173-479 
The event R is the collection of all subsets made up of two red jelly beans. 
Thus 53 53 - 52 

IR] = (0) = 

2 2 

Hence 


53 +52 2 53 +52 
P oa = : 
a 2 1738-172 173-172 
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The event M is the collection of all subsets made up of one red jelly bean 
and one green jelly beans. There are 53 ways to choose the red bean and 18 
ways to choose the green. Thus 


|M| = 53-18. 
Hence 


2  —. 53-18 
173172. 173+ 172 


P(M) = 53-18 


Exercise 8.3. Consider the situation of Exercise part (ii). In addition 
to x and y, suppose you are also interested in a third point z. Let B be the 
event that y and z are both in the selected set. Find P(B | A). 


When choosing sets, one has to be careful in labelling the sizes correctly, 
and counting. It’s not hard, just takes care. Let’s do a little practicing with 
that, next. 


Exercise 8.4. A bowl contains N marbles. K of the marbles are red, the 
rest are green. A subset consisting of n marbles is selected. No marble is 
favored. Let R; be the event that there are exactly 7 red marbles in the 
selected set. 

From the description of the problem, 


O<K<N,0<n<N. (8.8) 


(i) Suppose that all the following inequalities hold: 


0<i<K, (8.9) 
i<n, (8.10) 
K-i<N-—n. (8.11) 


Find P (R)). 
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Figure 8.1: Lemma [8.5} LS) IN | oe AG IC] = a. JOT] = a, 
IC —-(CNT)| =n-i 


(ii) Show that equations (8.9), (8.10), and (8.11) must hold in for any pos- 


sible outcome. Thus R; must be empty, i.e. R; = 0, when 7 does not 
satisfy all the inequalities in those equations. 


(iii) For what value of i is your answer to part (i) already given by Exer- 
cise |8.2// 


Abstractly, Exercise deals with a set S of size N, having a specified 
subset T’ of size K. In this situation another subset C’, having size n, is 
chosen. 

For the moment, don’t think about how C' is chosen. Let’s just consider 
a simple question: if all we know is that C' is a subset of S with size n, what 
are the possible values for the size of CNT’? 
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We'll repeat the arguments for Exercise starting by writing down 
inequalities. Clearly K <n andn <n. 

Let 2 denote the size of CNT. Then we must have 7 > 0,1 < K,i <n. 
Must 7 satisfy any other inequalities? 

Well, note that C = (CNT) U(C —(CNT)). So |C — (CNT)| =n-i, 
and the elements of C—(C'T) are in S—T. So we also have n—i < N—K, 
or equivalently n + kK < N +12, which is equivalent to K —i< N—n. 

Notice that in the solution for Exercise we argue slightly differently 
to obtain the same inequality, as follows. T = (CNT)U(T —(CNT)). So 
[7 —(CNT)| = K —i, and the elements of T—(C NT) are in S—C. So we 
also have K -—1< N—n. 


Incidentally, we might rewrite equations (8.9), (8.10), and (8.11) more 


symmetrically: 


oO %. 

i<ik, 

tn, 
Kae IN a4. 


(8.12) 


These inequalities in equation are symmetric in kK and n. They 
had to be, because we have not used any information here which treats T 
and C' differently. 

We have established that the size 1 of CNT must satisfy equation (8.12). 
As shown in Exercise no more conditions are needed. The following 
lemma asserts all this for the record. The remarks already given can easily 
be turned into a more formal proof. 


Lemma 8.5 (Possible intersections of two subsets). Let S be a set 
with |S| = N, and let T be a subset with |7| = K. Let n be an integer with 
O<n<n. 

For any integer 7 satisfying equation (8.12), there exists a subset C' with 
|C| =n, such that |C NT] =i. 

Conversely, if C' is a subset of S with size n, then i = |C'MT) satisfies 


equation (8.12). See Figure [8.1] 
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Definition 8.6 (The hypergeometric distribution). In Exercise [8.4] it 
was shown that (") ("-#) 


a 
when i satisfies the inequalities in equation (8.12) (or equivalently when 7 


satisfies equations (8-9), (8.10), and (8.11) ). 

For any distribution, if equation (8.13) holds when 2 satisfies the inequali- 
ties in equation (8.12), with P(R;) = 0 otherwise, we say that the distribution 
is the hypergeometric distribution, with parameters N, k,n. 

Since notations in other books will differ, for applications remember that: 


P(R;) = (8.13) 


e N is the total size of the population from which a random sample of 
size n is selected, 


e K is the size of a set of special elements in the population, and 


e 7 is the number of special elements that are in the sample. 


8.4.2 Choosing a sequence 


Let S be a set containing n elements. Consider the experiment of randomly 
choosing a sequence consisting of n elements, in such a way that no element 
is favored during the successive choices. It seems most natural to represent 
a sample point as the actual sequence of elements that are chosen. Let Q be 
the collection of subsequences of size n. 


Exercise 8.5. (i) Find the number of sample points in (. 


(ii) Suppose you have a special interest in two distinct elements in 5’, called 
x and y. Let A be the event that x is the third point chosen and y is the 
seventh point chosen. (Assume that n > 6.) Find P(A). 


Exercise 8.6. In the setting of Exercise suppose that, in addition to x 
and y, you are also interested in a point z, which is different from x and y. 
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Let B be the event that y is the seventh point chosen and z is fifth point 
chosen. Find P(B| A). (Assume n > 6.) 


Exercise 8.7. Recall Exercise In that experiment, a subset consisting 
of n marbles is chosen randomly from a bowl of n marbles, and R; is the 
event that exactly 2 of the chosen marbles are red. The total number of red 
marbles in the bowl is Kk. 

The final goal is to find P (R;), but one may decide to choose the marbles 
in the subset one at a time, afterwards ignoring the order in which the marbles 
are chosen. Let’s try that approach. 

The description of the experiment shows that the values of P (R;) follow 
the hypergeometric distribution, with parameters N, K,n (Definition [8.13] 
So the approach we are trying now must eventually eventually produce the 
formula for this distribution which was already given in equation (8.13). 

To compare our model with Bernoulli trials (Section (7.3), let a sample 
point be w = (w ,...,W,), where w, is the marble chosen at step @. The 
number of red marbles chosen is then the number of indices ¢ such that wy, 
is red. 

R; is the set of all outcomes (w1,..., Wy) such that exactly 7 of the marbles 
We are red. 

Much as when we studied coin tossing, let We be the event that the ¢-th 
marble chosen is red, so that W/ is the event that the th marble chosen is 
not red. Then for any w = (w1,...,Wn), 


{w} = Din... Dn, (8.14) 


where D; = W, if wy is red and D; = W; if wy is not red. 

Thus R; is the union of all events of the form Di MN... D,, where for 
each t, either D, = W; or D, = Wf, and where the D; = W; for exactly 7 of 
the times ft. 

The next step in this approach would be to calculate P(DiN...M Dy). 
But at this point there is an obstacle, since we don’t have independence. 

Explain why the events D,,...,D,, are not independent. 
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The next exercise shows that the sequence model considered in Exer- 
cise eventually leads to equation (8.13), as it should. There are more 
steps this way, but the steps are not hard. 


Exercise 8.8. In the setting of Exercise[8.7| consider the events D,N...ND, 
which make up R;. Show that every event D, 1... D, has the same 
probability, and find this probability. 

A good way to think about this is to use the sample space of Exercise[8.7| 

Thus a sample point is w = (w1,...,Wn), where wy is the marble chosen 
at step @. The number of red marbles chosen is then the number of indices @ 
such that wy is red. 

Since no marble is favored, all sequences (w1,..., Wn) have the same prob- 
ability. 


Example 8.7 (Plotting a hypergeometric distribution). In the setting 
of Exercise suppose we have a bowl with 120 marbles, 40 of which are 
red and the rest green. If you randomly select a single marble from this bowl, 
the probability of a red marble is 1/3 (by Theorem [2.23] say). 

In this setting, consider the experiment of Exercise with the total 
number of marbles equal to 120, the total number of red marbles equal to 
40, and 30 marbles are chosen from the bowl. 

In our present notation N = 120, K = 40 and n = 30. 

We are interested in P (R;), where R; is the event that i red marbles are 
obtained. 

This experiment differs from n Bernoulli trials, since the successive colors 
obtained in the n selections are not independent, and the values of P (R,) 
follow the hypergeometric distribution with parameters N, K,n. We will use 
the formula for this distribution derived earlier in Exercise 

Figure shows the graph of P (R;) versus i for one experiment (N = 
120, K = 40, n = 30). We see that this graph is similar to the coin-tossing 
graph in Figure but the graphs are not identical. 

For comparison, Figure shows the graph of P(R;) versus 7 when 
N = 12000, kK = 4000, n = 30. We see that this graph is almost identical to 
the graph in Figure Why is that?? 
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0.175 5 


? 0.150 
0.125 
0.100 
0.075 
0.050 
noe 0.025 

0.02 0.000 


i) 5 10 18 20 25 30 


0 S 10 15: 20 25 30 


(b) Probability that a randomly se- 
(a) Figure 7.1) again. P(k heads), _—_ lected subset of size 30 from a set of 
30 tosses, success prob 1/3. 120 contains 7 red marbles, when 40 
out of the 120 are red. 


it) 5 10 15 20 25 30 


(c) Probability that a randomly se- 
lected subset of size 30 from a set of 
12000 contains i red marbles, when 
4000 out of the 12000 are red. 
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Exercise 8.9. Suggest an answer to the question posed at the very end of 
Example 


8.5 Solutions for Chapter 
Solution (Exercise (8.1). 


Proof 
Ci) Goi) meat ea 


Bringing the summands to a common denominator gives 


@=k)(a=—1T)!+k@—1)l n! 7 @ 
k(n — k)! — Ki(n— ky! Vk 


Solution (Exercise |8.2). (i) By definition, a sample point is a subset of 
S having size n. Hence there are exactly eu sample points. 


(ii) Since no element is favored in the selection, every sample point must 
have the same probability. Hence each sample point has probability 1/ a ). 
By additivity, 

P(4) =) P({w}) = |Al ae 

weA (2) 

A sample point in A is a subset of S containing x and y and n — 2 additional 
elements from S. |A| is equal to the number of ways to choose n — 2 elements 
from S — {x,y}. Hence 


Paps =. (8.15) 


8.5. Solutions for Chapter 


Solution (Exercise |8.3)). Using, say, Exercise |8.2| part (iii), 


Ga (1-3) 
P(A) = ) , P(ANB) = (*) . 
Hence 
(ns) N-3 (N-3)! 
p(B} A) — 0 — Ans) _ TER) 
( | j= 2) _ oe > (N—2)! 
(®) n—2 (n—2)!(N—2)—(n—2))! 
a! 
_ weno _ (N-3)l(n-2!_ n-2 
Gono (N—2)!(n-—3)! N-2 


We have followed the general pattern of the conditional probability for- 
mula here. However, we would arrive at the same value for P(B| A) if we 
considered the selection of « and y as part of the setting of the experiment, 
so that the experiment consists of choosing the rest of the sample. Now the 
sample space (’ is the set of all subsets of S — {x,y}, and we want the prob- 
ability that when choosing n — 2 points from (2’, the point z is in the chosen 
set. By Theorem [2.23] this probability is (n — 2)/(N — 2). 

In this way we don’t need use of the conditional probability formula. The 
second method is a common approach to conditional probability problems. 


Solution (Exercise (8.4). (i) A sample point w is a subset of size n. 
Suppose that w € R;. 
Since i < K andi <n, there are (*) choices for the red marbles in w. 
Since n —i < N — K, there are ies = choices for the non-red marbles in 
W. 
Hence there are (*) ) choices for w in R;, that is, |R;| = (‘*) ome. 
As usual, |Q| = ‘eer Hence 


P(R) =o". (8.16) 


(ii) ioe EP says that every chosen red marble is a red marble. 
(8.10 


Equation (8.10) says that every chosen red marble is a chosen marble. 
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Equation (8.11) says that every remaining red marble is a remaining 
marble. 

So all three of these equations must hold. 
(iii) Rx is simply the event that all the red marbles are chosen. If we think 
of the red marbles as the elements of interest in the set, then part (iii) of 


lnon) 
Exercise tells us that P (Rx) = 4a. 


3 
Solution (Exercise |8.5). (i) A sample point is a sequence of distinct 


elements, having length n, so there are exactly P‘ sample points, where P 
is given by equation (8.1), 


Ge SA See 


(ii) Since no element of S is favored in selecting the sequence, all sample 
points have the same probability. Thus P({w}) = 1/P% for all w. 

For any sequence w in the event A, we are given the positions of x and 
y in the sequence. Thus the w is determined once the other elements in the 
sequence are determined. The other n — 2 elements in the sequence form a 
sequence consisting of distinct elements from S'—{x,y}. Since |S — {x, y}| = 
N — 2, there are P‘’5” choices for the other elements in the sequence. This 
shows that |A| = P’,?. 


P(4) = )7P({w}) = |Al py = 22 
weEA 
_ (N-2)...((N-2)-(n-2)4+1)_ 1 
7 N...(N-—n+1) ~ N(N—1)' oe 


The positions of x and y in the sequence were given, but we see that the 
probability is the same for any fixed choices of the positions. 


Solution (Exercise : 


P(ANB) RPS 
P(B|A) = P(A) = pi = PN+2 
RSD Sot n—2 

Py 


_ (N-3)...(N-3-(n-3)+1) _N-3 
 (N=2)...(N-—9—Mm—2)+1 °° N=2 
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As in the solution for Exercise|8.3} we won't need the conditional probability 
formula if we take the setting of our experiment to include the fact that x is 
the third point chosen and y is the seventh point chosen. 


Solution (Exercise |8.7)). To see what is going on, take n = 2. 
Suppose D, = W, and Dy = W,. Then P(D,) = K/N and P(D2) = 
K/N. 
However, we can find P (D2|D,) by thinking of the second choice as a 
self-contained experiment, with N — 1 marbles K — 1 red marbles. Hence 
K-1 


P(D2|Di) = 7 # P (D2). 


Thus independence does not hold. 


Solution (Exercise (8.8). We'll use the sample space of Exercise 
Thus a sample point is w = (w1,...,Wn), where wy is the marble chosen 
at step @. The number of red marbles chosen is then the number of indices @ 
such that wy is red. 
By equation (8.1), 
N! 


l= Grae 


Since all marbles are treated the same way, all sample points have the same 
probability. For any (w1,...,Wn), 


P({(w1,...,Wn)}) = a = Sbaoe (8.18) 


Let K be total number of red marbles in the bowl. 

Fixi,1<i<nandi< K. Let DjN...AD, be such that DpN...A Dy C 
R;. 

Consider (w1,...,Wn) € Di N...A Dy. Then we is red for exactly 7 
of the indices @. Let V be the set of indices @ such that wy is red. Let 
W = {1,...,n}—-—V be the other indices. 

There are i indices in V. So, using equation (8.1), the number of ways to 
choose xy for £ € V is 
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That is the number of ways to fill the red indices. The number of non-red 
indices is n — 7 And the number of non-red marbles is N — Kk. Hence the 
number of ways to fill the non-red indices is: is 


(N — K)! 
(N=) =(a=2))t 


Hence 
De hesc Ky) = 


Using equation (8.18), 


P(Di MN... Dn) = 


Thus 


PUD ia. DS — : ) . (8.19) 


Equation (8. 19) shows in particular that P(D,N...9D,,) has the same 
value for any set D,N...9 Dr, contained in R;. 

To check the probability value given by equation (8.19), note that a par- 
ticular event D;M...D,, is determined once the set V of red indices is 
chosen. Hence there are (") events D,N...9D, contained in R;, and so 


_ (n\ K\N—-K)! (N —n)! 
P(Ri) (") NI (K—-@!(N—-K)—(n-d)!’ 
n! K\N-—K)! (N —n)! 
a i(n—d! NI (K—AOI((N—K)—(n—d)! 20) 
That is, 


_ . (N — Kk)! ni(N —n)! 
~ al(K —4)! (n—a!((N — K)—(n—1))! NI 


This agrees with equation (8.13), which says that 
(M1) 
oa 


P(Ri) 


P(R;i) = 


8.5. Solutions for Chapter 


Solution (Exercise (8.9). When the total number of marbles is large, and 
the total number of red marbles is large, choosing one marble has little effect, 
meaning that the chance of a red marble on a second choice is almost the 
same as it was on the first choice, regardless of whether or not the first marble 
was red. Thus the outcomes (red or non-red) are almost independent, and 
choosing a 30 marbles is almost like 30 independent trials in coin-tossing. 

Incidentally, that graph in Figure[8.2b]looks a bit narrower than the graph 
in Figure doesn’t it? Could there be a reason for that? 
A reason is given at the end of Example 
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Chapter 9 


Random variables 


In this chapter we introduce a new concept, random variables. The benefits 
of using this concept will become evident in the chapters that follow. The 
current chapter has important definitions and examples, but not much in the 
way of applications and clever calculations. Readers should try to enjoy the 
peaceful contemplation of well-chosen concepts. This investment will pay off 
later. 


9.1 Random variables defined 


Definition 9.1 (Random variables). Physically, a random variable for 
an experiment is a quantity whose value depends on the outcome of the 
experiment. In our discussions the quantity will usually be a number, but it 
might be a vector, a set, a symbol, or some other property. 

Mathematically, a random variable is a function whose domain is the 
sample space 2 of a model, and whose values can be of any kind. A real- 
valued random variable is a function from a sample space 2 to the real 
numbers. 

To save words, we often simply use the phrase “random variable” to mean 
a real-valued random variable, since that is the most common case for us. 
Since we are following that convention, when we are dealing with a random 
variable whose values are not numbers, we’ll try to say what the values are, 
or at least add an adjective to make that clear. For example we might speak 
of a “vector-valued random vector”, or a “random vector”, to indicate that 
our random variable has vector values. 
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By convention, random variables are normally denoted by uppercase let- 
ters, with X,Y, Z being the most common choices. 


We'll give some examples of random variables soon, after a brief discussion 
of the definition. 

Definition is not quite complete, since it omits a technicality. This 
technicality has no practical significance for our work in this book, and can 
be safely ignored. A brief discussion of what we are ignoring is given below 
in Section 


Notation for properties and sets of values 

Suppose that X is a real-valued random variable, and that for some reason 
we want to consider the probability that X is greater than five and less than 
eight. 

The usual notation for this probability is P(5 < X < 8). That expresses 
the probability using “property language’. 

We can also write the same probability as P(X € (5,8) ). That expresses 
the probability using “set language”. 

The same kind of notation is used in general. If S' is the set of all possible 
values that have the property that we are interested in, we can write P(X € 
S') to denote the probability that the value of X lies in the set S. 

We could use a more formal mathematical notation for P(X € S). If X 
is a function on a sample space 22, we could define an event A by 


A={w: X(w)€ S$}. (9.1) 


Then P(A) would be P(X € S). 

But usually it is more convenient to use the briefer notations which are 
common in probability. So we just write P(X € S) instead of defining A. 
The notations used in probability theory are the following: 


{xX eSb={w: X(w)eS}, 


P(X €S)=P({X € S}) =P({w: X(w) € $}). (9.2) 


Readers will find that this type of notation is quite clear and easy to read. 


Notation for random variables Since a random variable is a function, 
why not use a typical function name, such as “f”, instead of an uppercase 
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letter? Perhaps an uppercase letter is used to remind the reader that the 
domain of a random variable is the set of possible outcomes for an experiment. 
This can be very different from the domain of a function in calculus, which 
is usually an interval of the real line. 

Calling a random variable “X” offers another benefit. If we want to refer 
to a value of the random variable X, we can denote the value by the lowercase 
letter “x”. This reminds the reader of the source of the value. 

In this chapter we will mainly consider a random variable whose set of 
possible values is finite, i.e. a random variable with finite range. But most 
of the concepts make sense for general mathematical random variables. 


Example 9.2 (The result of one coin toss). For a single toss of a coin, 
let X = 1 if the result is a head, and let X = 0 if the result is a tail. In other 
words, X is equal to the number of heads obtained by this toss. 

If the probability of a head is p, then the probability that X = 1 is equal 
to p, and the probability that X = 0 is equal to 1 — p. 

X is a very boring random variable! However, we will soon see that more 
interesting random variables can be built using random variables like X. 

To represent X mathematically, if the sample space 2 is equal to the 
two-point set {1,0}, as in Example then X(w) = w, but, as usual in 
applications, there is no need to use any particular sample space. Given a 
physical random variable X , the mathematical random variable representing 
X is valid if it has the correct values, and produces those values with the 
correct probabilities. 


Remark 9.3 (An example of an alternate sample space for one coin 
toss). To emphasize the fact that the sample space is not unique, here’s a 
extreme example of an alternate sample space. We could take Q equal to the 
whole unit interval [0,1], and use the uniform probability P on [0, 1]. In this 
case, define a random variable X by X(w) = 1 if 0<w <p, and X(w) =0 
ifp <w <1. (Think about randomly choosing a point in the unit interval 
(with a uniform probability distribution) and shouting “success!” or “Pay 
me!” if the chosen point lies in the interval [0, p).) 

Notice that with this definition we have arranged matters so that the 
possible values of X are 1 and 0, the probability that X = 1 is equal to p, 
and the probability that X = 0 is equal to 1 — p. This exactly matches the 
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physically observable behavior of the random variable X which was defined 
on the two-point set {0,1}. 

Remember, what matters are the values, and the probabilities of those 
values. That’s what is “real” about the mathematical random variable. 

One might certainly say that using Q = [0,1] is wasteful, since we don’t 
need such a big sample space, but the sample space is not incorrect. It might 
even be appropriate in a complicated experiment, if there are additional 
properties that must be represented using the same sample space. 


Exercise 9.1 (Notation check). Let X be the random variable defined in 
Remark Let S = {0}. Find {x € s}. 


Example 9.4 (The result of one roll of a die). For a single roll of a die, 
let X be the number that shows on the die when it comes to rest. Then the 
possible values for X are 1, 2,3,4,5,6. 

If the die is fair, then the probability that X = 7 is equal to 1/6 for all 
i. In general, the probability that X = 7 will be some probability p;, where 
pi t+ po+ p3 + pa+ ps + pe = 1. 

If the sample space 2 is equal to the six-point set {1,2,3,4,5,6}, then 
X(w) = w, but, as Remark illustrates, we could can always use some 
other sample space. 


The language of random variables gives us a new way to describe some 
events, but we still find probabilities using the same rules. The next exercise 
illustrates this. 


Exercise 9.2. Let X be the random variable defined in Example 


e Let A = {X > 4}, and let B be the event that X is an even number. 
Find A and B, as subsets of {1, 2,3, 4,5, 6}. 


e Find the probability that X is an even number. 
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Example 9.5 (Number of successes in Bernoulli trials (coin tosses) ). 
Let Aj,...,A, be Bernoulli trials (see Section with success probability 
p. That is, Ay,..., An are independent and P (A;) = p for each 2. 

As our most typical example, the experiment consist of tossing a coin n 
times, and A; might be the event that toss 7 gives a head. 

Let S, be the total number of successes. (This notation is not related to 
our earlier use of S to denote some set.) By definition, S,, is the number of 
indices 2 such that A; occurs. For the experiment of tossing a coin n times, 
S,, is the number of heads which are obtained. 

The possible values for S,, are 0,1,..., so S, has a fairly simple range. 

The event {S, =k} is exactly the event G; described in Theorem 
Thus equation (7.10) states that 


P(Sn =) = (7) Aa - py (9.3) 


9.2 The probability of obtaining a value ina 
set 


Random variables are implicitly present in any probability model. Using the 
terminology of random variables explicitly is often convenient, even when we 
are performing the same old calculations. 

Readers might want to look one more time at the calculation in the second 
solution of Exercise The calculation is trivial, of course, but it feels 
liberating to simply write 


1 1 
P(X is even) = P(X = 2) +P(X=4)+P(X=6)=E +e teas, 


without giving a thought to the sample space. 
It’s useful to state a general version of the same argument. 
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Let X be any random variable, and let W be any set such that W only 
contains a finite number of points in the range of X. Let x1,...,x2% be any 
list of distinct values which includes all the numbers in the range of X that 
are members of W. If the value of X is a member of W, then the value of X 


must be equal to one of the numbers 71,...,2,%. Thus 
{XA EW ba{xA Hay} Un UX Sah. (9.4) 
Since the values 71,..., Z, are distinct, the sets {X = 2,},...,{X =2,} are 


disjoint. By the additivity of probability we have 
P(X EW) =P(X =2))+...+ P(X = 2). (9.5) 


We typically use facts like equation without comment. Equations of 
this sort help us to think about events in terms of what they mean, rather 
than as subsets of the abstract sample space. 

The next exercise extends that equation (9.5). All readers should note 
the statements, and think about them enough to see that they are true. 


Exercise 9.3 (Cases for a random variable). Let X be any random 
variable. 


(i) For any sets W,,...,Wz,, which need not be subsets of the range of X, 
if W =W,U...UW,, show that 


{X EW} ={X EW }U...U{X © W,}. (9.6) 


Some of the sets {X € W;} may be empty, but that’s fine. 


(ii) Suppose now that the sets W,,...,W, are disjoint. Show that the sets 
{X €W,},...,{X © W;} are disjoint. Let W =W,U...UW,. Show that 


P(X EW) =P(X €W,) +...+ P(X € W,). (9.7) 


(By taking W; = {x;}, we see that equation (9.7) includes equation (9.5) as 
a special case.) 
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9.3 Estimating probability sums 


For the random variable S,, in Example |9.5| equations and can be 
used to find the probability of any event defined in terms of the value of S;,. 
However, when n is large it may take some work to extract the information 
we need. 

For example, toss a fair coin a million times. Let n = 1000000, so S,, is the 
number of heads obtained. We are rarely interested in the tiny probability 
that S, is exactly equal to 499,500. But we might be interested in, say, 
the probability that at most 49.95% of the tosses resulted in a head, i.e. 
P (S,, < 499,500). How do we find this probability, or at least estimate this 
probability in some way? We know from equation and additivity that 
the probability is given by 


499,500 1,000,000 
1,000,000) /1\ 1° 
P(S;, < 499,500) = 5 ( aa, ) (5) (9.8) 
j 


j=0 
True, but the size of this number does not exactly leap out at us. Not only 
are there many terms, but any term that makes a significant contribution 
to the sum must be the product of a very large number times a very small 
number. 
The Central Limit Theorem ([I1]) is a powerful method for estimating 
probabilities like P (S,, < 499,500). Incidentally, the Central Limit Theorem 
tells us that P (S,, < 499,500) is approximately equal to .159. 


9.4 Random variable distributions 


Recall that we introduced the general idea of a probability distribution in 
Definition|1.13} Any rule which assigns probabilities for a family of events can 
be called a probability distribution. The next definition is a very important 
example of this terminology. 


Definition 9.6 (The distribution of a general random variable). For 
any real-valued random variable X associated with any probability model, 
the probability distribution of X is the rule that specifies P(X € S) for 
subsets S of R. 
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We write X ~ Y to express the fact that X and Y are random variables 
with the same distribution. 


Notice that since a distribution specifies probabilities, the probability 
distribution of a random variable is something that can be measured exper- 
imentally, or at least tested. If X is a random variable associated with a 
repeatable experiment, and if someone asserts that P(X > 5) is equal to 
.3, then we could in principle perform many repetitions of the experiment, 
and measure the average number of times that X > 5 occurs, to see if this 
frequency is close to .3. So distributions are “real”. 

On the other hand, a sample space is an abstract concept in our minds, 
which is useful but can never be directly measured. If two people are sepa- 
rately creating probability models for the same experiment, they may come 
up with very different sample spaces. But they must agree about the distri- 
butions of any physically meaningful random variables. 


At present we are mainly dealing with finite-range random variables. 
Equation (9.5) shows that it is easy to find the distribution of a finite-range 
random variable once we know the probability of each point in the range. 
Sometimes it’s convenient to use the probability mass function notation (in- 
troduced in Definition [2.12). 


Definition 9.7 (The probability mass function for the distribution 
of a random variable). Let X be a real-valued random variable for a prob- 
ability model. The probability mass function for the distribution of X is the 
function q on R defined by g(x) = P(X = 2). 

Clearly q(x) = 0 for any x which is not in the range of X, so q is deter- 
mined by its values as a function on the range of X. 


Let X be a finite-range random variable whose distribution has probabil- 
ity mass function g. Let W be a subset of R, and let x,...,2, be any list of 
distinct numbers which includes all the numbers in the range of X that are 
members of W. We can rewrite Equation (9.5) using g: 


P(X €W) =q(21) +...9(zx). (9.9) 
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When dealing with a finite-range random variable X , one sometimes refers 
to the probability mass function of X briefly as the distribution of X. This 
is natural, since it determines the distribution. 


Example 9.8 (A random variable with a binomial distribution). For 
the random variable S,, defined in Example[9.5| using equation (9.3) we have 


n be 
a(k) = PS. =) = (oh —py* (9.10) 
The distribution of S;,, is exactly the binomial distribution defined in Defini- 
tion We say that S,, has a binomial distribution, and may also refer to 
S, as a binomial random variable. 


Example 9.9 (A random variable with a hypergeometric distribu- 
tion). Consider a set of N objects, K of which are in a certain target class. 
Let a set of n objects be randomly selected from the N objects (sampling 
without replacement). We assume that the inequalities in equation 
hold, ie. KH < Nandn< WN. 

Let Lyx, be the number of target objects in the selected set. P(Ly. kn = 
i) is the value P (R;) studied in Exercise [8.4] and Definition [8.6] 

Thus the distribution of Ly,~«., is the hypergeometric distribution, with 
parameters N, K,n, which was defined in Definition [8.6] 

By definition, the range of Lyx, is the set of all 7 such that equa- 


tion (8.12) holds. 
By equation (8.16), 


90) =P Lyn =) = — (9.11) 


for i such that equation (8.12) holds. Otherwise P(Ly,«.n = 7%) = 0. 


Can we graph a random variable? Our main experience with functions 
has likely been in the setting of calculus, and in calculus we certainly can 
understand a function better by plotting its graph. It can be difficult to 
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graph a random variable, though, since the domain of the random variable 
might be very different from the real line. Example [9.5]illustrates this, since 
for this example the domain is the sample space, and that might be the set 
of all sequences of zeros and ones that have length n. There seems to be no 
convenient way to portray the set of such sequences visually, at least when 
n is greater than 2 or 3. 

What we can do is to graph the probabilities for the values of the random 
variable, i.e. the probability mass function. 

For the random variable X of Example Figure shows a graph of 
P(X =k) versus k for n = 30 and p = 1/3. 

In Example [8.7] we considered a slightly different random variable. Here 
L420,30,40 18 the number of red marbles in a set of 30 marbles randomly se- 
lected from a bowl containing 120 marbles, when 40 of the marbles in the 
bowl are red, so L120,30,40 has a hypergeometric distribution. The graph of 
P(L120,30,40 = 7) versus i was given in Figure [8.2b} 


9.5 Expressing the distribution of X using a 
probability density 


If X is a random variable which does not have a finite range, it may not be 
obvious how to describe its distribution. But in many cases the distribution 
can be described using a probability density. In this chapter we’ll just give the 
basic idea, concentrating on uniform distributions. Example|9.14]in the next 
section will put densities to work, and show how they simplify computations 
and clarify our thinking. 

Probably densities were defined using equation of Section|3.4] Read- 
ers may wish to review that definition, as well as Remark [3.6] 


Definition 9.10 (Density of the distribution of a real-valued random 
variable). The distribution of a real-valued random variable X is described 
by a probability density h if the probability that the value of X lies in a set 
S is given by the integral of h over S, for subsets S of the real line. 

In other words, 


P(X €S) = [. (9.12) 
fol 


for subsets S of the real line. 
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(To be more precise mathematically, equation (9.12) holds for every set 
S that we would ever use to describe an event. See Section [9.7] for another 
comment on this.) 


In equation (9.12), the integral of h over the set S is written as [,h. 
This is the modern notation for integration over a set, as in equation (3.10) 
of Section 

Although the general concept of integration over a set is not difficult (see 
Definition 3.5), we are more familiar with the special case of integrating over 
intervals, using calculus notation. When S' is the interval [a, 0], 


[r= [na finorae (9.13) 


Remark 9.11 (Intervals are enough). In Remark |[3.6]it is asserted that if 
an equation like (9.12) is valid when S is an interval, then we are guaranteed 
that it holds for all subsets S of the real line. So if you want to check that 
some function h is the correct density for the distribution of X, it’s enough 
to check that 

P(X EJ)= [on (9.14) 

J 

for all intervals [a,b] of the real line. Most of the time we will work with 
intervals. 


Example 9.12 (Extending a density to the whole line). Consider the 
experiment of choosing a point from the interval |s, t], with no point favored. 
If J is a subinterval of [s,t], and A is the event that the chosen point lies 


in J, then equation (3.4) of Section [3.3] says that 


length(J/) 
ce length((s, f}) 


Let f be the constant function which is equal to 1/(t — s) everywhere on 
[s, ¢]. 
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As in Exercise it’s easy to check that 


_length(J)_ =f soa f(t) 
length((s, t}) S, 7 


A) = i: F(t) dt. (9.15) 


Now let’s express everything in the language of random variables. Let X 
be the chosen point. X is a random variable for this experiment. Equation 
(9.15) says that 


Thus 


P(X € J) = iE: (9.16) 


for every interval J which is contained in |{s, ¢]. 

Does that mean that f is a density for the distribution of X? Well, not 
quite. For f to be the density, f must be defined on all of the real line. And 
equation must hold for all intervals, not just intervals that are subsets 
of [s, ¢]. 

Of course, the description of the experiment says that the random point 
is never selected from the complement of [s,t]. So there is zero probability 
that the value of X lies outside the interval [s,t], ie. P(X € [s, t]°) = 0. 

If we just extend the density f to the whole real line, by setting f(z) = 
for all x in the complement of [s,t], that should be a valid density for the 
distribution of X. So let’s do that. Define the function h by 


h(x) = - os (9.17) 


0 otherwise. 


See Figure It seems very plausible that h is a valid density for the 
distribution of X, but let’s check that. We must check that 


P(X €J)= [on (9.18) 
J 


for all intervals J. 

Since h = f on [s,t], equation holds for every interval J which is 
contained in [s,t]. That’s how we got started. 

What about other intervals J? For any interval J which is contained in 
the complement of [s,t], the description of the experiment says that P(J) = 
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2 4 6 8 


Figure 9.1: h is a density on R for the distribution of X, when choosing a 
point in [2, 7]. 


0. And since h = 0 everywhere on the complement of [s,t], both sides of 
equation are zero, so that case works too. 

We’ve shown that the density h works when J is contained in the interval 
[s,t], and it also works when J is completely outside the interval [s, t]. That 
makes us confident that h is indeed a correct density for the distribution of 
X, and readers will probably be satisfied with that. 

However, if you feel like more checking, you can consider the general case, 
when the interval J might have a piece which is inside [s,t] and a piece that 
is outside [s,t]. Exercise |9.5| asks you to do that. 


Remark 9.13 (Terminology for distributions). To save words, in the 
situation of equation (9.17), we may just say that “the distribution of X is 
uniform on [s,t], and is zero everywhere else”. 


Exercise 9.4. Let X be a random variable whose distribution has a density 
h which is equal to a constant on [3, 11] and is equal to zero elsewhere. Find 
Pil <X < 5). 


Exercise 9.5. Based on the facts already established in Example |9.12| show 
that equation (9.18) holds for a general interval J, which might have a part 
inside |s,t] and a part outside. 
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9.6 Random variables as a tool for thinking 


When modeling a real-world problem, random variables occur naturally, and 
we naturally analyze the problem in terms of random variables. 


Example 9.14. Recall Exercise [4.12] There we consider an experiment with 
two steps: first a fair coin is tossed. Then, if the result of the toss is a head, 
in step two a point is chosen from [0,3], with no point favored. If the result 
of the coin toss is a tail, in step two a point is chosen from [2,4], with no 
point favored. 

Our goal in Exercise[4.12]was to find the probability that the chosen point 
lies in a given subinterval J of [0, 4]. 

Let X be the point chosen in step two. Then X is a random variable, 
and we can restate the goal of the problem as: find P(X € J). This leads 
us to consider trying to obtain a probability density h for the distribution of 
Xx. 

First, let’s think about conditional probabilities. If the coin toss gives a 
head, the second step of the experiment consists of choosing a point from 
(0, 3], with a probability distribution which is uniform on [0,3]. Thus, condi- 
tional on obtaining a head, we know by equation that the distribution 
of X has a density h, which is given by 


= ifz € [0,3] 
h = 3 one 9.19 
(2) f otherwise. ( ) 


To say that h, is a density for the distribution of X conditional on H means 
that for all S, 
P(X €S|H)= : hy. (9.20) 
Ss 


Similarly, if the coin toss gives a tail, the second step of the experiment 
consists of choosing a point from [2,4]. Conditional on that, the distribution 
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of X has a density hg which is given by 


5 ifae [2,4] 
h =— <2 ee 9.21 
22) otherwise. ( ) 


To say that hg is a density for the distribution of X conditional on TJ means 
that for all S, 


P(X €S|T)= | ho. (9.22) 
Ss 


How can we combine the conditional densities h,,h2 to find h? 

Just as in the solution to the original form of Exercise we can use 
the Law of Total Probability (Theorem |4.3). Using the language of random 
variables, this says that for any interval J, 


P(X € J) = P(H)P(X € J| H)+P(T)P(X € J|T), (9.23) 


where H,T are the events that the coin toss gives a head or a tail. 

Probability densities are not probabilities, but they are closely related to 
probabilities. Looking at equation (9.23) makes us think that we will get a 
valid density h if we define h by 


h = P(H)hy + P(T)ho. (9.24) 


And we do! Just integrate the right side of equation (9.24) over an interval 
J, use equations (9.19) and (9.22), with S = J, and you will obtain the right 
side of equation (9.23). Thus the integral of h over J gives us P(J). 

A probability density is correct if it gives the correct probabilities when 
you integrate it, so the function h defined by equation is a valid density 
for the distribution of X. 

For this experiment, P(H) = P(T’) = 1/2, and equations (9.19) and 
give us h; and hy. Substituting these values into Equation (9.24) 


o) 


$3 if x € [0, 2), 
iivjil if x € [2.3 
aes ee (9.25 
29 if x E (3541; 
0 otherwise. 


A graph of h is shown in Figure 
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Figure 9.2: h is a density on R for the distribution of X, where X is chosen 
from overlapping intervals. 


Please check that integrating h gives all the information obtained in 
Cases (i), (ii), (iii), (iv) of the solution for Exercise [4.12] Of course the 
density h contains much more information, and we can display the graph of 
h} 


In the next chapter, we will define the expected value of a random vari- 
able, which is a fundamental concept for the application of probability. 


9.7 A technical point about sets 


We mentioned in Section 9.1] that Definition [9.1] omitted a technicality. For 
those who are interested, here are some remarks about that. 


Remark 9.15 (Measurable sets). Recall that Definition stated that 
in any probability model, some subsets of a sample space are designated as 
events. This definition did not say that every subset of the sample space is 
an event. And, in fact, a complete description of a mathematical probability 
model includes an extra requirement, something like this: every event must 
be a “measurable set”. 

What does the mathematical term “measurable” mean in this context? 
It has a special meaning here, different from the ordinary meaning of an 
experimental measurement. 

As a typical example, consider the real numbers. Roughly speaking, a 
measurable set of real numbers is any set which has an explicit description. 
The term “explicit description” is used rather generously, since it includes 
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any mathematical construction using an infinite sequence of set operations 
on intervals, or an infinite sequence of infinite sequences of set operations, 
and so on, forever. Any set that could conceivably be used in our applications 
of mathematics is a measurable set. 

An optimistic person might conclude from these statements that every 
subset of the real line is measurable, but sadly this is not the case. It can be 
shown mathematically that there must exist subsets of the real line which are 
nonmeasurable. So the best we can say is that any set which is “of interest” 
is measurable. 

One might philosophize that having such bizarre subsets lurking in the 
background is part of the price that we pay for using a powerful abstraction 
like the real numbers. 

In contrast to the real numbers, every set of integers is a measurable set. 
But real numbers are so useful in models that we will not give them up. 

When studying advanced probability, it is necessary at times to check 
that the mathematical theorems of probability can be applied without using 
nonmeasurable sets. Definition [9.1] would then be slightly enlarged, to spell 
out the technical requirements that a random variable must satisfy. However, 
in applications these requirements are always met. 

So in this book, if we are dealing with any real-valued function X on a 
sample space, you may simply take it for granted that X is a valid random 
variable. And if S is any set that we are interested in, then you may take it 
for granted that {X € S} is an event. 

To avoid a lot of verbiage, we'll use the following convention. 


Any statement about sets, such as “for all sets S”, or “for any set 
S”, may actually mean “for all measurable sets S”, or “for any 
measurable set S”. And the difference between these statements 
has no practical significance. 


These conventions apply, for example, to equation (9.7) and equation (9.12). 


9.8 Solutions for Chapter [9| 


Solution (Exercise 9.1). By definition, X(w) = 1 if 0 < w < p, and 
Ale) =O po < 1. 
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Thus 


Solution (Exercise |9.2)). 


e X is the number that shows on the die when it comes to rest. The 
range of X is {1,2,3,4, 5, 6}. 


Let’s take the sample space 2 to be the six-point set {1,2,3,4,5,6}, so 
that X(w) = w for each w € 2. 


From the definition of X, A = {5,6} and B = {2,4, 6}. 


e By additivity, 


P(B) = P({2}) + P({4}) + P({6}) = : +; rE 3 


Let’s repeat the same argument again, this time without using an ex- 
plicit sample space. There is really no need to define a sample space, 
as long as we understand the behavior of X. 


Each possible value of X has the same probability, so we can immedi- 
ately say that 


P(X =1) =P(X =2) =P(X =3) = P(X =4) = P(X =5) =P(X 


These numbers add to one, so P(X =7) = 1/6 fori =1,...,6. 


If the value of X is even, then the value of X is one of the numbers 
2,4,6. Hence 


{X is even} = {X = 2}U{X =4} U{X = 6}. 


By the additivity of probability, 


P(X is even) = P(X =2)4+ P(X =4)4+ P(X =6)= ie => 


With practice, reasoning about events in terms of the values of a ran- 
dom variable will seem very natural. 
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Solution (Exercise [9.3). (i) We must show that {X € W} is the union 
of the sets {X € Wi},...,{X € We}. 

For any sets A and B, one can prove that A = B by showing two facts: 
first, that every member of A is a member of B, and second, that every 
member of B is a member of A. Thus we consider two cases here. 

(1.) Suppose that w is in the union of the sets {X € Wi},...,{X © We}. 
Then for some i, w € {X € W;}, and so X(w) € W;, and thus X(w) € W, 
and sow € {X € W}. 

(2.) Suppose thatw € {X EW}. If X(w) € W, then X(w) € W; 
for some i, and sow € {X € Wj}, and thus w is in the union of the sets 
{X EW,},...,{X © Wy}. 

Facts 1. and 2. show that {X € W} is the union of the sets {X € Wi},...,{X © We}, 
as claimed. 


(ii) Let W,,...,W), be disjoint sets. We claim that the sets {X € Wi},...,{X © Wz} 
are disjoint. 

To see that, suppose that for some sample point w, we have w € {X € W;} 
and w € {X € X;}. Then X(w) € W; and X(w) € W;. Since the sets 
W,,...,W, are assumed to be disjoint, it must be the case that 7 = 7. 

Thus for i 4 j, {X € W;} and {X € W;} have no points in common, i.e. 
they are disjoint, as claimed. 


Let W be the union of the sets W,...,W,. By part (i), {X € W} is the 
union of the sets {X € Wi},...,{X © Wg}, ice. 


{X Ew} ={X EW, }U...U{X eW,}. 
Using additivity, equation (9.7) holds. 


Solution (Exercise |9.4). We are told that h is constant on [3,11]. Let c 
denote the value of h on [3, 11]. Since h is a probability density on R, 


/ pest 
Since h = 0 outside [3, 11], 


lo) otal 
/ n= | h=c(11—3) = 8e. 
—oo 3 


221 


Thus ¢= 1/8. 


Chapter 9. Random variables 


Let J = [1,5]. We’re asked to find P(X € J). Since h is a density for the 


distribution of X, 
5 
p(xe d= fa=f h. 
J 1 


Since h = 0 outside [3, 11], 


5 54 1 
P(xe d= | n= | 22 
3 3 8 4 


Solution (Exercise (9.5). The description of the experiment says that the 
point is chosen from |[s, t], so it is never chosen from the complement of |s, ¢]. 
Thus, P(X € [a, 5°) = 0. 

For any interval J, since 


P(X € J) =P(X € Jn [a,b]) + P(X € Jn [a,d)9), 


we have 


P(X € J) =P(X € JN [a, d)). (9.26) 


And we also know that h = 0 everywhere on |s,t]°. So 


/ ie | h. (9.27) 
J Jn{a,b] 


It is easy to see that the set JM [a,6] is an interval, and of course it is 
contained in [a,b]. We’ve already considered that case, so we know that 


P(X € JN [a,b)) -|/ h. 


IN{a,b] 


Applying equations (9.26) and (9.27), this turns into 
P(X ce J)= | lis 
G 


which is equation (9.18). 
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Expected values, finite range 
case 


10.1 Expected value defined 


We will define expected value in this chapter for the case of a random variable 
with finite range, and then establish the main properties of expected values. 

When more general random variables are studied, the definition of ex- 
pected value will have to be appropriately extended. However the properties 
of expected value will remain unchanged, for the most part. 


Example 10.1 (Average payoff). Consider tossing an unfair coin. If the 
result is a head, we say you have success. Let the probability of success be 
3/5. 

Suppose you toss the coin 1000 times, and for each success you receive 2 
dollars. For failure you get nothing. What is the average amount of money 
that you would expect to earn per toss? 

The amount you actually receive on any given toss might be called the 
“payoff”. So we are asking for the average payoff. 

Notice that the payoff on any given toss is determined by the outcome of 
the toss. Thus it is a function of the physical result of the toss, and so it is 
a random variable in the physical sense. We want to know the average value 
of this random variable over a sequence of repeated tosses. 

If we choose a mathematical sample space to represent the physical ex- 
periment, then the payoff is represented by a mathematical function whose 
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domain is the sample space, so it is a random variable in the mathematical 
sense. But let’s think physically for a moment. 

In order to calculate the average payoff, think about the frequency of suc- 
cesses. If you toss the coin 1000 times, it is likely that your success frequency 
will be approximately 3/5. Since 3/5 x 1000 = 600, approximately 600 of 
the tosses will result in success. Thus you expect to earn approximately 1200 
dollars over the whole sequence of trials, so you expect to earn approximately 
1.20 per toss. This is your “average payoff”. 


In Example[10.1} we have just given a theoretical estimate for the average 
payoff in repeated tosses, without actually performing any tosses. That is 
certainly simpler than doing repeated experiments! We call this theoretical 
estimate the expected value for the payoff random variable. 

The expected value of a random variable is only a single number. But it 
tells us something about all the possible values of the random variable, taken 
together. This is a new idea. 

The theoretical approach to finding an expected value is not just simpler 
than the experimental alternative. It may also help us to understand the 
experiment situation which is being studied. 

Now we will give a precise mathematical definition for expected value, for 
the case of a random variable with finite range. 


Definition 10.2 (Expected value, finite range case). Let X be a ran- 
dom variable. Suppose that the range of X is equal to {x,...,2,}, where 
%1,...,, are distinct numbers. 

The expected value of X, denoted by EX], is defined by 


E [Xx] = dy aP (X = 2;). (10.1) 


In other books, E [X] is often written as EX . 
The expected value of X is also called the expectation of X or the mean 
of X. A random variable with expected value zero is often called a mean 
zero random variable. 
Occasionally it is helpful to have a notation which explicitly states which 
probability is being used to calculate the expected value. The expected value 
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of X using probability set-function P is then denoted by 
Ep [X]. (10.2) 


If a different probability set-function Q is used, then the corresponding ex- 
pected value is denoted by Eg [X], and so on. 


The expected value of X is defined by the sum in equation (10.1). This 
sum is said to be a weighted average (Definition|A.2). It is a weighted average 
of the possible values of X, in which the weight of each value x; is the 
probability P(X =;) with which it occurs. Readers who have not used 
weighted averages may find it worthwhile to work through some exercises in 
Appendix [A] 

Example [A.3]shows we can picture the expected value of X as the center 
of mass of the distribution, when the distribution is represented as lumps of 
probability mass located at the values of X. 

Since expected value of X is defined in terms of the values of X and their 
probabilities, it is determined by the distribution of X . The distribution of 
a random variable is a real and testable physical property, so the expected 
value is a real and testable physical property. Expectation can be calculated 
using any valid sample space representation that you like, but the value must 
be the same for any valid sample space. 


Exercise 10.1 (One-toss payoff). In Example [10.1] consider just tossing 
the coin once. 

If the sample space for one toss is 2 = {0,1}, the payoff function Y for 
one toss is simply defined by Y(1) = 2, Y(0) = 0. 

Use Definition [10.2] to find E[Y]. 


Exercise 10.2 (Distinct values). In Definition |10.2| why are the numbers 
XL1,...,%n required to be distinct? 
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Exercise 10.3 (Order of values is irrelevant). In Definition]|10.2| 71,..., x, 
is a list of the distinct values in the range of the random variable. Explain 
why the order in which we list the values does not matter. 


Is it important to check our general formulas, to make sure they are right? 
Well, somebody should certainly check. Expectation is such an important 
concept that it seems worthwhile to check that every property we need is a 
consequence of the definitions. We won’t always take time to do this, but in 
the proof of the next lemma we will. 


Lemma 10.3 (Single event expectation). Let A be an event and let c be 
areal number. Let X be the random variable defined by X(w) = cifw € A, 
X(w) = 0 otherwise. Then 


E[X] =cP(A). (10.3) 


Proof. Case 1 If A is the empty set, then X(w) = 0 for all w, so the range 
of X is {0}. 
Then by definition E [X] = 0-P(X = 0) = 0 =cP(A), so equation (10.3) 
holds. 
Case 2 If c= 0 then again the range of X is {0}, and equation (10.3) holds 
as in Case 1. 
Case 3. If A=Q, then the range of X is {c}, and by definition E[X] = 
ce: P(X =c) =c- P(Q), so equation (10.3) holds. 
Case 4 The only remaining case is that A 4 @ and A4Q, with c ¥ 0. 
Then the range of X consists of the distinct points 0, c. 
Then by definition E[X] = c- P(A) +0-P(A‘) = c- P(A). Thus 


equation (10.3) holds. 


An important consequence of Lemma for any constant c, 
E {c] =c, (10.4) 
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where E [c] denotes the expected value of the random variable which is equal 
to c for all outcomes. 


Hmm, we looked at a lot of cases when proving Lemma [10.3| The next 
exercise would have saved some work! 


Exercise 10.4 (Unused values in the definition are ok). Let X be a 


random variable. Let y1,...,%, be distinct numbers, such that every nonzero 
number in the range of X is included in the list y,,..., yn. Prove that 
E[X] =o yP(X =y). (10.5) 
i=1 


Exercise 10.5. Use Exercise to give a shorter proof of Lemma [10.3] 


Example 10.4 (One coin toss). Let X be as in Example[9.2| Notice that 
X is the number of successes obtained in the coin toss (either 0 or 1). 

Let A be the event that the toss gives success. Then X(w) = 1 when 
w € A, and X(w) = 0 otherwise. 

Applying Exercise|10.4| E[X] =1-P(X =1)+0-P(X =0) =p. 


Example 10.5 (One roll of a die ). We deal with the result of rolling a die 
(Example similarly to Example Let X be the number obtained 
by rolling the die, so that the range of X is {1,2,3,4, 5,6}. 

By definition, E |X] = 1-P(X = 1)+2-P(X = 2)+3-P(X = 3)+4-P(X = 
4)+5-P(X =5)+6-P(X =6), ie. 


E[X] =) iP(X =i). 


i=1 
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Lemma 10.6 (Expectation of a scaled random variable). Let X be a 
random variable and let c be a number. Then 


E [ce X] =cE[X]. (10.6) 


Proof. The statement is true in general, but we only consider the case of a 
finite-range random variable here. 

If c= 0, then c X is the zero random variable. Using Lemma|10.3] (or the 
definition of expected value), we know that E |[c X] = 0, so we certainly have 
E[cX] =cE[X]. 

From now on suppose that c 0. 


Let 21,...,2,% be the distinct numbers in the range of X. Since c ¥ 0, 
we have X = g; if and only if cX =cz;. Hence the range of c X is the set 
{c21,...,c2,}, and the numbers cz ,...,c 2, are distinct. By the definition 


of expected value, 
E|exX| =ca,P(exX 0x) +...+¢x,PicX =cx,). 
But P(cX =c2z;) = P(X = 2;) (it’s the same event), so 


E [eX] =caP(X = 21) +...+¢crP(X = zy) = cE[X]. 


The property of expectation stated in Lemma|10.6]is very simple, but it’s 
important. Just to have a name for this property, we’ll call it the scaling 
property. This is not a standard mathematical term, but it fits, since 
Lemma|l0.6]says that if we scale (up or down) all the values of X by a factor 
c, then we scale E [|X] by the same factor. 


The next exercise is an example for an upcoming theorem, Theorem 
However, it is instructive to solve it directly here. 


Exercise 10.6 (Finding expected value using cases). Consider the 
number wheel described in Exercise [2.13 

Imagine a game in which the wheel is spun. Let Z be the number at 
which the wheel stops. Then Z is a random variable, and the possible values 
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for Z are 0,1,2,...,100. We assume that each of these numbers occurs with 


the same probability. 


In this game, a payoff is given, based on the number where the wheel 


stops, i.e. based on the value of Z. 
Let the payoff be called X. 
The rules are as follows: 


e If Z =0, then X = 0. 


e If Z=100, X =5. 


e If Z is even and less than 100, then X = 2. 


e If Z is odd, then X = 1. 


Thus X = y(Z), where ¢ is defined in the obvious way: 


}) = 2 if 7 is even and less than 100, 
p(t) = 1 if 7 is odd. 


(10.7) 


(i) Find E[X], using the definition of expected value. 


(ii) Show that 


E [X] = dL P(Z =i). 


100 


(10.8) 


Does the statement of equation feel right to you? Since (i) is the 
value of X when Z = 1, this equation says that E[X] is equal to a weighted 
sum of values of X, but it’s not the same sum which is used in the definition 
of E[X]. Instead it’s a sum over cases, where each case is given by the value 
of Z. The weight of each case is the probability of the case. 
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10.2 Expected value by cases 


Please think about the statement of the theorem until it seems reasonable. 
The proof is optional, but it is not hard to understand the basic step: group- 
ing similar terms in a sum over cases, where each term is a value times a 
probability. 


Theorem 10.7 (Expectation by cases). Let D;,..., Dj, be disjoint events 
in some model, such that D,; U...U D, =. 


Let v1,...,v% be numbers, and let X be a random variable such that 
X(w) = vu; at every point w in Dj. 
Then 
k 
E[X] = 5) u;P(D)). (10.9) 
i=1 


See Figure [10.1] 


Proof. Let x1,...,%m be a list of the distinct numbers in the range of X. 

If an event D; is nonempty, then it contains at least one point w. By 
assumption, X(w) = uv; . Thus for every nonempty D;, v; is a point in the 
range of X. 

For each 7, if D; happens to be empty, change v; to one of the values 
X1,...,%m. Such changes clearly make no difference to the sum in equation 
(10.9), so they doesn’t affect the truth of the theorem. 

Now we can say that the numbers v,...,v, are a list of the values 
1,...,%m, possibly with repetitions, as Figure [10.1] illustrates. 

Since D; U...U Dy = , for every x; there must be at least one event D; 
such that v; = 2;. 

We can choose the labels for the numbers 71,...,% » so that 71 < % < 
s5+ << Um. 

The order in which we write the events D; makes no difference in equation 
(10.9). For convenience, relabel the events and associated values so that 
Vy Vg SS... SK Up_y < Up. 

For every j, let 2; be the largest index 7 such that v; = x;. Because of the 
ordering of the values, 2, must be k. 

The following picture illustrates the general situation. 

The top row is simply D,,..., Dx, in order, but grouped. 
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Dy eae De Diag wee Di sees Dg sae 


YS SUS UE = OP = Vig Ss < Vim 141 = °° = Vien + (10.10) 
a 


1 


Ty x2 Lm 


The events in Figure are grouped in a similar way. 
We see that 
{XH ay p= Dy UU Da; 


and for 7 = 2,.2.57™M, 
{xX = tet = Dy 544 U---U Di; (10.11) 


Define i9 = 0. Then equation (10.11) holds for all 7 = 1,...,m, which 
saves writing. 
Since the events D; are disjoint, for every 7 we have 


P(X = a) _ PED. 444) Se P(D,) 


By definition, 
E[X] =} 0 ajP(X = 2;) = 5/2; (P(Di_.41) +... + P(Dy,)) 
f=1 
~ 2 (2j;P(Di,_.41) ger t;P(D;,)) 


= >. (Uigga Dy, 55%) ese vi,P(Di,)) 


In the proof of Theorem [10.7] did we really need to relabel the events D; 
and values v; so that vy < vg <... < ug_y < UR? 

We didn’t do that in Figure did we? And no, we don’t really need 
the relabelling step. But if we didn’t do that, we wouldn’t be able to display 
the general grouping picture given in equation (10.10). 
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You can still talk about the grouping, though, by defining sets of indices: 
you would define N; be the set of indices 7 such that v; = x;. Then, instead 
of saying: 

P(X = 2j) = P(Diy_s41) +... + PDs), 


=S°P(D 


i€N; 


you would say: 


and run the same argument. 


yen 
SS 


Figure 10.1: For Theorem [10.7] 10. [10.7 Here v1 = 21, Vg = V3 = Xo, and v4 = Vs = 
x3, where 21,22,%3 are distinct. {X =2,} = D,, {X =x} = D.UDs, 
{X = x3} = Ds, U Ds. 


Exercise is an example for the next theorem. 


Theorem 10.8 (Expectation of a function of a finite-range random 
variable). Let Y be a random variable on a sample space 2. Let the distinct 
values in the range of Y be yj,..., yz. In this theorem there is no need to 
assume that yi,...,Yx are numbers. They can be anything. 


232 


10.3. The frequency interpretation for expectation 


Let y be any real-valued function whose domain includes y;,..., yz. Then 


E[y(Y)] = > em) PW =x). (10.12) 


i=1 


Proof. Let Dj = {Y = ys}, let vi = y (y:), and apply Theorem [10.7| 


Exercise 10.7. Suppose that the distribution of X is uniform on the points 
{—2,1,0,1,2}. Find E[X?] in two ways: from the definition and using The- 
orem [10.8 


Example 10.9 (Expectations on finite sample spaces). Let X be a 
random variable defined on a finite sample space (2. Let the distinct sample 
points be w1,...,Wn. 


In Theorem |10.7} let D; = {w;}. Equation (10.9) gives us a pleasantly 
simple formula for expected value: 


B[X] =} X(wi)P({wi}). (10.13) 


Of course the values X(w;) might not be distinct. But as usual that’s 
ok, we only give each outcome its own probability weight, so there is no 
“double-counting” . 


10.3 The frequency interpretation for expec- 
tation 


Here is a general statement of the key fact linking expected value to the real 
world. We have already seen this in Example [10.1| 
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Probability Fact 10.1 (The frequency interpretation of expected 
value). Let X be a random variable defined in terms of an experiment. If 
the experiment is repeated many times, the theoretical expected value of the 
corresponding mathematical random variable is likely to be approximately 
equal to average experimental value of the random variable X obtained from 
the repeated experiments. 


Justifying the frequency interpretation for expected values We can- 
not give a rigorous proof of a practical statement, but we will show that for 
a random variable with finite range this rule is a direct consequence of the 
frequency interpretation of probability. A similar argument was already used 
in Example [10.1] 

Let X be a random variable with finite range. Suppose that the range of 
X consists of the distinct numbers 21,..., x. 

By definition, 


EX) =P (XxX =e) +... +P is =z, ). 


Suppose that X represents a physical random variable in some experiment. 
Consider a sequence of N repetitions of the experiment. 

Let M; be the number of those experiments for which the value of X is 
equal to x;. Then the average experimental value x for X is given by 


1 t< B.A 
— WV (sum of all measured values) = NW ps ¢,M; = 2 ay 


The frequency interpretation for probability says that for large N, it is 
very likely that M;/N ~ P(X = 2;). Applying this approximation to every 
term in the sum for Z, 


rR dP (X =2;) = E[X]. (10.14) 


Equation (10.14)) expresses Rule 10.1} so we have justified this rule. 


A traditional name for Rule is “the Law of Large Numbers”. Some 
corresponding mathematical properties of expected value are given in two 
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well-known theorems called “the Weak Law of Large Numbers” and “the 
Strong Law of Large Numbers”. These are mathematical statements. The 
law of large numbers expressed here in Rule is a practical statement, 
not a mathematical theorem. 


10.4 Additivity of expectation 


The frequency interpretation of expected value provides a strong connection 
between theoretical calculations and experimental results. We can use the 
physical interpretation of expected value to tell us what the mathematical 
properties must be. 

For example, consider two physical random variables X and Y which 
are defined for the same experiment. The measured value of X + Y is, by 
definition, the sum of the value of X and value of Y. By the frequency 
interpretation, the average of the measured values of X + Y, over sufficiently 
many repeated experiments, is approximately E[X + Y]. And the average 
value for X +Y is equal to the sum of the average value for X and the average 
value for Y. This frequency argument leaves us in no doubt that additivity 
must hold for mathematical expected values: 


E(X+Y]=E[X]+E(Y]. (10.15) 


Equation (10.15) is confirmed with the formal proof in Lemma 10.10} given 
below. Additivity actually holds for expectations of all random variables (see 
the statement of Theorem |14.9). 


Lemma 10.10 (Additivity of expectation). Let X and Y be finite-range 


random variables defined for the same probability model. Then 


E(X+Y]=E[X]+E(Y]. (10.16) 


Proof. Let 21,...,2%» be the distinct numbers in the range of X, and let 
Y1,--+;Ym be the distinct numbers in the range of Y. 
By Theorem (10.7), 


nm m 


E[X] = >> >> 2:P(Dy) (10.17) 


i=1 j=l 
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B[Y]=)>_ > wP(Du) (10.18) 


i=1 j=l 


and 
n m 


E[IX+Y]= 5° YS) (ai + yj) Pi). (10.19) 


i=1 j=1 
The right side of equation (10.19) is the sum of the right sides of equations 


(10.17) and (10.18). Hence E[X + Y] = E[X]+ E[Y]. 


Exercise 10.8. If you love the frequency interpretation, write out a careful 
derivation of the additivity property for physical random variables, using the 
frequency interpretation. 


Remark 10.11 (Using multiple indices). In the proof of Lemma [10.10] 
does it seem strange to apply Theorem (10.7) to a situation in which the 
disjoint sets are described by multiple indices 7, 7? 

It is important to see that this is ok. Notice that the properties of the sets 
D,,...,Dxz in Theorem (10.7), such as disjointness and having union equal 
to the whole space, do not depend on how the sets D,,..., Dz are labelled. 

Also, the sum in equation would not be changed if we listed the 
sets D,,..., Dy, in a different way, provided that we included all of the sets 
and did not list any of them more than once. 

So the way we label our indices doesn’t matter. 


Here’s some useful terminology. 


Definition 10.12 (Linear operations). Consider any set of mathematical 
elements such that “addition” and “multiplication by a number” make sense. 
Examples: the set of coordinate vectors in R”, the set of functions on an 
interval, the set of random variables on a sample space.) 

An operation on such elements is said to be a linear operation if it pre- 
serves addition and also preserves multiplication by a number. More pre- 
cisely, a linear operation is an operation with the following properties: 
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(i) The Additivity Property. The result of applying the operation to 
a sum of elements is equal to the sum of the results of applying the 
operation to each term separately. 


(ii) The Scaling Property. The result of multiplying an element by a 
constant and then applying the operation is the same as the result of 
applying the operation first, and multiplying by the number afterwards. 


Linearity is a handy term, and we often use it. The rules of calculus 
tell us that integration of functions is an example of a linear operation. By 
Lemma and the operation of taking expected value is a linear 
operation. The next lemma records this fact for future reference. 


Lemma 10.13 (Expectation is linear). Taking expected value is a linear 
operation, i.e. 


E[X+Y]=E[X]+E[Y] 


E [cX] = cE [X]. 


10.5 Using linearity to find expectations 


We will use linearity so much that it will seem instinctive. Here are a few 
examples in which linearity plays a role in finding expectations. 


10.5.1 Expected number of successes for Bernoulli tri- 
als 


As in Example let S,, be the total number of successes in n Bernoulli 
trials with success probability p. Then 5S, has a binomial distribution. We 
wish to find E[S,,]. 
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Before we perform this calculation, let’s try to make it seem a little im- 
pressive. Remember that n could be huge, a million or a trillion. If we really 
care about the result, our method had better be right. Definitions and proofs 
are what gives us the confidence to produce numbers in situations where even 
computers would be too slow. 


Method 1: using additivity Let X;=1o0n A; and X; =0 on A®%. (Thus 
X; is the indicator function for the event A;, as defined in Definition [17.1]) 
We have already observed that X; is the number of successes obtained in 
trial 7 (either 0 or 1). From the definition of S,,, it follows that S, = X, + 
...+X,. By additivity, E[S,] = E[X1]+...+E[X,]. From the definition 
of expectation, E[X;] =1-P(X;=1)+0-P(X; =0) =p. Hence 


E[S,] = np. (10.20) 


Notice that E[S,,| is exactly what we would immediately compute from 
the frequency interpretation, which is the basis of the common sense reason- 
ing used in Example [10.1] 


Exercise 10.9 (Method 2 for expected number of heads). Method 2 
is what you use when you don’t remember that expectation is additive. It is 
perhaps unnecessary to add that this is not the right approach. Nevertheless, 
we can learn from it. 
By equation (9.3), P(S, =k) = (?)p*(1— p)”*. 
Calculate E[S;,] again, this time using Definition [10.2] and this formula. 
As in Example we can guess ahead of time that E[S,] = np. So if 
we see a factor of np in the algebra we should hang onto it in the calculation. 
Finishing the calculation will verify our guess. 


10.5.2 Expected value of a hypergeometric random vari- 
able 


Consider the experiment of Exercise |8.4] In that experiment, we have a bowl 
which contains N marbles. A subset of n marbles is selected, with no marble 
favored. There are K red marbles in the bowl, the others being green. Let 
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the random variable Ly,~,, be the number of red marbles selected. We wish 
to find E [Ly,Knl- 

Ly.«n has hypergeometric distribution with parameters N, k,n (Defi- 
nition [9.9), and our calculation for the expected value applies to any such 
random variable. 

Since linearity worked so well for coin-tossing, it’s the natural method to 
try here. And it works. But we need to set up the problem, and use all the 
information we have. The tricks we use here are worth noting! 


Method 1 for E[Ly,x,,|: using additivity Assign each marble an iden- 
tification number, so that the marbles are numbered from 1 to N. For con- 
venience, let marbles 1,..., AK be the red ones. 

Let X;¢ = 1 if marble @ is selected, X~ = 0 otherwise. Since no marble is 
favored, E|X;] is the same for every @. 

Incidentally, the experiment was defined as choosing a subset of n marbles. 
But it’s ok to focus on what happens to a particular marble. From the 
definitions, 

Inkn=Xit...+Xx. (10.21) 


By linearity, 
E[Ly,xn] = B[Xi) +...+ E[Xx] = KE[X)]. (10.22) 


To find E [X,] with minimal work, note that from the description of the 
experiment there are always exactly n marbles selected. Hence 


Xy+...+¢Xy =n, (10.23) 


always. 

Did we cheat here? When we write down this sum of N terms, did we 
change the experiment? We’re only supposed to choose n marbles. 

It’s ok, because we are not performing a new experiment, we are just 
writing down a true mathematical fact about the random variables in the 
model for the original experiment. 

Mathematical expectation has been proven to be linear. So now you can 
go ahead and take the expected value of equation (10.23). This gives 


n=B[Xi]+...+E[Xy] = NE[X)). 
This gives E [|X| = n/N, and hence by equation (10.22) we have 


nk 
E[Ly,6.n] = > (10.24) 
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so we are done. 


Remark 10.14 (Special cases). Fact 1 Incidentally, since we showed 
that E [|X| = n/N, and since X; is always either 1 or 0, equation tells 
us that E[X;| = P(X; = 1). Thus our work has shown that the probability 
of any particular marble _ selected is exactly n/N. 


Fact 2. Also, by equation we know that E[Ly,.«1] = nk = =K/N: 
Since Ly.«1 1s always Ps T or “A we know that the probability that a 
single selected marble lies in the target set is AK/N. This probability agrees 
with the statement of Theorem |2.23 


Do you think that Fact 1 and Fact 2 are really the same statement? In 
Fact 1, we have a particular element, and randomly choose n points. In 
Fact 2, we have a particular set of K element, and randomly choose a one 
point. But since the particular point could be any point, and the particular 
set could be any set, it seems that in both cases we might as well say that 
we have a random set and a random point, chosen independently, and we are 
finding the probability that the random point is an element in the random 
set. 


Method 2 for E[Ly.x,,|: direct calculation By “direct calculation” we 
mean something like the method of Exercise This is feasible, and a 
calculation is given next. You should definitely skip it as long as you agree 
that this is not the easy approach! 

As in ee 9-9} the range of Ly x» is the set of all 7 such that 


equation (8.12 Se Gan 
By i reo , for each 2 in the range of Lyn,» we have 


Glee 
() 


Piinega =?) = 
Hence by definition: 


E[Ly.xnl = Sir =a =o O)Ga) => i) Ge) 
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* 


where we write y to mean a sum over the indices 7 in the range of Ly xn, 


7 
* 


and we write Ss" to mean a sum over the nonzero indices 7 in the range of 
i>0 
In Kn: 
Using equation (8.7), 

E[Ewweal <> GA) G) _ Ka GaGa) 

N,K,n}| = = = N— 

i>0 ea nN i>0 Ca) 
Sined N — K =(N-1)-(K —-1) andn-i=(n-1)-(- 


(10.25) 


1), this gives 


x (K-1\ ((N-1)-(K-1) 
k i n—1)-—(i— 
E[Ly,xnl = ~ ( eat y) (10.26) 
t>0 n-1 


By equation (8.12), the nonzero indices 7 in the range of Ly.«.» are those 

2 such that 
14, ht N =8, 3 A, Bd 2 x (10.27) 

Let Lny—1,K—-1n—1 denote a hypergeometric random variable with param- 
eters N—-—1,kK —1,n-1. 

Let €=i—1. Then equation (10.27) says that 

0<2%, (K-1)-£<(N-1)-(n-1), and @<n-1. 

This is exactly the statement that @ is in the range of Ly—1,K~1n—-1. Hence 


* Cato.) ee Corte gut aie 
3 1 (=I) 1) -S- e ome £ (10.28) 


i>0 n-1 £ n-1 


eK 


where we write y to mean a sum over the indices ¢ in the range of Ly—1,K~1,n-1, 


e 
Using equation (9.11) with N, k,n replaced by N —1, K —1,n—1, equa- 
tion (10.28) says that 


Cay (e-em) a 


Sa TE Plena = 9 


w>0 n-1 £ 


This is the sum of the probabilities of the values of Ly—1%-1n-1 over all 
possible values, so the sum is equal to one! By equation (10.26), E [Lynn] = 
Kn/N. 
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10.5.3 Reflection symmetry 


If x is a point on the real line we will say that the point —z is the mirror 
image of x under reflection in the origin. 

Consider a physical random variable X such that for any possible value 
of X, the negative of that value is just as likely to occur. Over many ex- 
periments, the positive and negative values of this random variable will tend 
to cancel out. In the long run, the average should be close to zero. By the 
Frequency Interpretation of Expected Value, it must be true that E |X] = 0. 

The next exercise asks you to give a more precise argument to show that 
E [xX] = 0. 


Exercise 10.10. Let X bea random variable with finite range. Let a,,..., az 
be distinct positive numbers, and suppose that the nonzero range of X is the 
set of numbers aj, —@1, 2, —d2,...,@% — dx. In addition, suppose that for 
each 2 = 1islagk, 

P(X =—4,;) = P(X =a). (10.29) 


Use Definition to show that E[X] = 0. As usual, Exercise is 
convenient in applying the definition of expected value. 


Exercise }10.10]/uses mathematical reasoning which is close to the physical 
picture. But general mathematical arguments can be more powerful, as in 
the next lemma. 


Lemma 10.15 (Reflection symmetry gives mean zero). Let X be a 
random variable such that X and —X have the same distribution. 
If E[X] exists, then E [|X] = 0. 


Proof. The expected value of any random variable is determined by its dis- 
tribution, and for this particular random variable X it is assumed that X 
and —X have the same distribution. Therefore E [|X] = E[—X]. 

Using linearity of expectation, E |—X] = —E[X], and so 2E[X] = 0. 
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Does Lemma [10.15] give Exercise as a special case? Sure! Saying 
that P(X = —a;) = P(X = a;) is the same as saying that P(—X = a;) = 
P(X = a;), and so the assumption of Exercise is equivalent to the 
assumption that X and —X have the same distribution. 


Exercise 10.11. Let X be a random variable whose range is exactly the set 
S of integers i with —1000000 < 7 < 1000000. Assume that the distribution 
of X is uniform on S. Find E[X] in two ways: 


(i) By calculation using the definition of expected value, and 
(ii) using Lemma /}10.15 


Exercise 10.12. Let X be a random variable whose range is exactly the set 
S of integers i with 0 < x < 1000. Assume that the distribution of X is 
uniform on S. 


(i) Find the distribution of X — 500. 
(ii) Find E [X]. 


10.6 Monotonicity of expectations 


Exact values are often not available, so we need to be able to deal with 
estimates and inequalities. 


Lemma 10.16 (Monotonicity of expectations). Let X and Y be random 
variables with finite range, such that X(w) < Y(w) for all sample points w. 
Then 

E[X] < E[y]. (10.30) 
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Proof. By assumption, Y — X is a nonnegative random variable, i.e. all 
values are nonnegative. 

The definition of expected value in the finite range case shows that 
E[Y — X] > 0. 

Since Y = X + (Y — X), additivity tells us that 


E[Y] = E[X]+E[y — X]. 


Since E[Y — X] > 0, we are done. 


Exercise 10.13. Give a derivation of the monotonicity property for physical 
random variables, using the frequency interpretation. 


Example 10.17 (E |X] and E||X|]). Let X be a finite-range random vari- 
able, and let 2;,...,k be the distrinct numbers in the range of X. Using the 
definition of expected value and the triangle inequality (Appendix |B), 


k oo oo 
|E[X]| =| 55 aP(X =2,)| < $0 |aP(X =2;)| = 5 > [ail P(X = 2). 
i=l i=1 i=1 
The numbers |x1|,...,|2~| may not be distinct, if X happens to have both 


positive and negative values. But using Theorem |10.12| we see that 
> lel P(X =a) = B[[X]]. 


So we have proved an interesting inequality: 
|E[X]| < E[|X]]. (10.31) 


Of course, our proof used the finite-range property for X. But the inequality 
is true for general expected values. Furthermore, there is actually a slick way 
to derive it, just using general properties: linearity and monotonicity: 

Note that X < |X|. Hence, by monotonicity, E|X] < E[|X| ]. 

But we also have —X < |—X| = |X|. So, by monotonicity, E[—X] < 
E [|X|]. Then, by linearity, -E |X] < E[|X|]. 
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One of the numbers E[X], —E[X] must be equal to | E[X]|. And each 
of these numbers is less than or equal to E[ |X|]. So we have shown that 


equation (10.31) holds in general. 


10.7 General random variables 


We won't take time to define expected value for general mathematical random 
variables in this book, but we should remember that expectation is not just 
something for random variables that have finite range. Expectation can be 
defined for any bounded random variable. A bounded random variable is a 
random variable which is a bounded function on the sample space. 

You could probably guess the definition of a bounded function, but we'll 
state it carefully anyway. 


Definition 10.18 (Bounded functions). A function f on any set is said 
to be bounded if there is some number c such that |f(x)| < c holds for every 
x in the domain of f. 


For unbounded random variables, sometimes the expected value exists, 
and sometimes it doesn’t. 


Remark 10.19 (Linearity for expectation of general random vari- 
ables). The linearity property holds for bounded random variables, as we 
would hope. For unbounded random variables, the linear property comes 
with a little bit of “fine print”, since its expected value might not exist. So 
the correct way to state Lemma|[10.10]in the general case will be to say that 
if E[X] and E[Y] exist, then E [|X + Y] exists and equation (10.16) holds 
(see Theorem [14.9). Similarly, the general version of Lemma gee that 
if EX] exists then E [c X] exists and equation holds. 
That seems easy to remember. 


In this book we'll usually stick to finite-range random variables when we 
want to give a careful derivation of some fact. But much of what is true for 
finite-range random variables is true in general. 
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Example 10.20 (Expectations with a density on the sample space). 
Just so readers can see an example of calculating expected values using a dif- 
ferent method from that given in equation (10.1), suppose that the sample 
space 22 is a continuous interval [s,t] of the real line, as discussed in Chap- 
ter [3] Assume that probabilities are given by a uniform distribution on |s, ¢] 
(Definition [3.2). 

As in Exercise]3.5] the uniform distribution on [s, t] is given by a constant 
density, say f = c. And since P(Q) = 1 must hold, we need to have 1 f= 
and so c= 1/(t—s). 

In that situation, if X is a random variable on the sample space, it turns 
out that the correct formula for E [X] is: 


E[X] = [ xeoew (10.32) 


In other words, here we find E[X] by integrating its value over the sample 
space. 

More generally, if the distribution on 2 = |s,t] is given by a density 
function f (as in Definition 3.4), the correct formula for E [|X] is: 


E[x|= | X(u)f (u) du. (10.33) 


Not surprising, just different from finite-range case. 


10.8 Solutions for Chapter 


Solution (Exercise|10.1}). The range of Y is {0,2}, while P(Y = 0) = 2/5 
and. PUY = 2) = 2/5. 
By definition, 
2 3 
BY |=}=s0+s2=12 

Y¥]=,0+, 
Solution (Exercise |10.2)). Each value v in the range should contribute to 
the expected value E [|X]. The contribution is given by the term v P(X = v) 
in the sum which defines E |X]. 
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Physically, the importance of a value v for the expected value should 
depend on the probability that X is equal to v. If a value appears twice in 
the sum, then its contribution to the sum is doubled. This is not consistent 
with actual importance of the value. 


Solution (Exercise |10.3). Definition shows that E[X] is given by a 
weighted sum of terms. 
Addition is commutative! (Yes, I’ve been waiting for a chance to say 
that.) Changing the order of the terms in a sum does not change the a value 
of the sum. 


Solution (Exercise |10.4). Let x),..., 2, be a list of the distinct elements 
in the range of X. By definition, 


E[X]= dL aiP(X =): 


To show that equation (10.5) holds, we need to compare two sums, and see 
if they are equal: 


The order of the terms in a sum does not matter. 

Suppose that a value y; is not in the range. Then P(X = y;) = 0, and 
the term y;P(X = y;) = 0, so that term contributes nothing to the sum on 
the right. We can throw away any term like that from the sum on the right. 

Suppose that 0 is in the range. Then 0 = a; for some 7. The term 
x;P(X = 2,;) = 0, so that term contributes nothing to the sum on the left, 
and we can throw it away from the sum on the left. For the same reason, if 
y; = 0 for some 7, we can throw away the term y;P(X = y;). 

After all this throwing away, the remaining sum on the left will have the 
same terms as the remaining sum on the right, possibly in a different order. 
So the sums are indeed equal. 


Solution (Exercise |10.5). Apply Exercise with n = 1 and y, =. 
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Solution (Exercise |10.6). (i) From the assumptions, the range of X is 
{0,1,2,5}, and 

{x S=0 S17 =}, 

{<S1-] (FST Use SoU. = oo}. 


{X =2}={Z=2}U{Z=4}U...U{Z = 98}, ee 

(a Sd 00, 
Hence the distribution of X is given by: 

1 
P(X =0)]PiZ ) To1? 
P(X =1)=P(4=1)+P(4=3) P(Z = 99) = | 
49 (10.35) 

P(X =2)=P(4=2)+P(24=4) P(2 = 98) = 101° 
P(X =5)=]P(zZ 00) 


By definition, 


E[X] =0-P(X =0)+1-P(X =1)+2-P(X =2)+5-P(X =5) 
1 50 49 1 153 


Ses eee te 2. (10.36 
101 7 101 ah 101 = 101 ~=—-:101 ( ) 

(ii) By equation (10.35), 

0- P(x =) =0-P(4=0), 

1+ PX =1)=1°P(4 =1) +1-P4 = 3) + .4.4+ 1+ PU = 99), (10.37) 

2: P(X =2) =2-P(Z =2)+2-P(Z =4) +...42-P(Z = 98), , 

5: P(X =5)=5+ P(4 = 100). 
Since y gives the value of X in terms of the value of Z, we can rewrite 
equation (10.37) as: 
0-P(X = 0) = 9(0)-P(Z = 0), 
1-P(X =1) =¢9(1)- P(Z = 1) + 9(3)-P(Z = 3) +... + v(99) -P(Z = 99), 
2-P(X = 2) = 9(2)-P(Z = 2) + (4) -P(Z = 4) +... + (98) -P(Z = 98), 
5: P(X = 5) = ~(100)- P(z = 100) 
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If you add up all the equations in statement (10.38), you obtain: 


100 


ELX] =) ()P(Z =), 


which is equation (10.8). 


To phrase this differently: the proof of equation (10.8) is just a matter of 
grouping the terms in the sum, in order to obtain equation (10.36). 


Solution (Exercise |10.7). 


From the definition of expected value 


2 
E [X?] =O = Ohl ROS ad) pd BC Se) ede 


Using Theorem [10.8] 


, ae al el jee ae nee ae 
E [X?] = (2) fam 1) ee et "t2 ne ee 
Solution (Exercise [10.8). Consider a long sequence of N repeated exper- 
iments. Let the measured values of X in these experiments be 71,...,2%N 
and let the measured values of Y in these experiments by y,...,ynv. Then 
the measured results for X + Y are x,+y1,...,%y + yn. The corresponding 
experimental averages are: 


te f <_ ie 


Of course 


so 
CY =T +9. (10.40) 
The frequency interpretation for expected value tells us that for large N, 
x EX], y = E[Y] andx+y ~ E[X+Y]. Since these approximations 
can be made as precise as we like by taking a large number of repetitions, 
equation (10.40) implies ELX + Y] = E[X]+ E[Y]. 
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Solution (Exercise [10.13). Consider a long sequence of N repeated ex- 
periments. Let the measured values of X in these experiments be 71,...,2N 
and let the measured values of Y in these experiments by y1,...,yy- 

By assumption, x; < y; for every 2. 
The corresponding experimental averages are: 


i= yt y= er (10.41) 


Since 


we have % < y. 
Taking N larger and larger gives averages % and y which approximate 
E [X] and E[Y] as precisely as we like. Hence we must have E |X] < E[Y]. 


Solution (Exercise |10.9)). By definition 


E[S,] = > k (j,)otc a) ee 


The k = 0 term is zero, so 


By equation (8.7) 
=“ nf(n-1 k n—-k . fm L\ a4 (n—1)—(k—1) 
E[S,] 3d a ) wwe (tte (1—p) 
Letting i =k —1, 
n—-1 7 i 
E[S,] = pela pye 10.42 
[Sn] mr ( Jp ) (10.42) 


Let S;,-; denote the number of heads obtained in n — 1 coin tosses, when the 
coin has success probability p. Equation (10.42) says that 


10.8. Solutions for Chapter [10] 


Since the range of S,_1 is {0,1,...,n —1}, 


Hence E[S,,] = np. 


Solution (Exercise |10.10). By Exercise [10.4] 


E [X] = a, P(X = a1)+...+a,P(X = a,)—a1P(X = —a1)—...—a,P(X = —a,) = 0. 
Solution (Exercise |10.11). 
Method 1 


range of X = {—10000000, —999999, ... ,999999, 1000000} , 


and all these points have equal probability. There are 2000001 points in the 
range. Hence, by definition, 


1000000 1 = 1 1000000 
E [|X] = ———— ; ———— |+() ——__ ; ———___.. 
x) , 2 ‘ 3000001 ( Dy ‘ staat) 2000001 | a " 3000001 
i=—1000000 i=—1000000 i=1 
Thus 
4 1 1000000 1 
E [|X] = | ——___ | ——___., 
x] ( a. sama + 20 * s5p0001 
i=—1000000 i=1 
Let 7 = —7 in the first of these two sums. Then that sum becomes 


10000000 1 
7 2, 7 39000001" 
and this term cancels with the second sum, so E[X] = 0. 
Method 2 Again we note that 
range of X = {—10000000, —999999, ... ,999999, 1000000} , 


All points in the range have the same probability, and if 7 is in the range 
then so is —j. 

Since P(—X = 7) = P(X = -7) = P(X = J), it follows that X and —X 
have the same distribution. Hence by Lemma}10.15} E [X] = 0. 
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Solution (Exercise |10.12)). We see that 
range of X = {0,1,..., 1000}. 


All these values have the same probability, so 


Let Y = X — 500. 
range of Y = {0 — 500, 1 — 500,..., 1000 — 500} = {—500, —499,... , 499, 500}. 


Hence 


Fact 1 Y and —Y have the same range. 


We also see that for each 7 in the range of Y, 


P(Y =f) = P(X — 500 = #) = P(X =i + 500) = 


Hence 


Fact 2 All the points in the range of Y have the same probability. 


Facts 1 and 2 imply that P(Y =7) = P(Y = —1) = P(—Y =i) for alli 
in the range of Y. Thus the distributions of Y and —Y are the same. 

By Lemma 10.15} E [Y] = 0. 

That is, E [|X — 500] = 0. 

Now we use linearity again. Since expectation is a linear operation, 


E[X] — E[500] =0, ie. E [X] = 500. 
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More properties of expected 
value 


11.1 Indicator Functions 


In this section we introduce a simple notation which is useful when writing 
expressions involving expectations or integrals. 


Definition 11.1 (Indicator function of a set). Let a set S be given, and 
let A be any subset of S. We define the indicator function of A, denoted 
by 1y, as follows. 

1, is a function on S, and for any x € S: 


14(z) = lige A, (11.1) 
ae 0 otherwise. : 


It should be emphasized that indicator functions are a general idea, de- 
fined for subsets of any set, not just sample spaces or subsets of the real line. 
You can picture 1,4(x) as a signal light which comes on when x is a member 
of A. 

Please check that from the definition, 


l4=lp = A=B. (11.2) 


Here we use <=> to mean “if and only if” (i.e. “implies” in both directions). 
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Lemma can be expressed using indicator functions. It says that: 
E[c1,4] = cP(A). (11.3) 


In particular we have the fundamental equation connecting probability and 
expected value: 


E [1,4] = P(A). (11.4) 


It seems to be easier to manipulate numbers than sets, so it can be prof- 
itable to translate set statements into indicator statements. 


Exercise 11.1 (Basic indicator facts). Prove all the following facts: 


i a Oe 5) 
Ige = 1—1a,, (11.6) 
Lang = min (14,1) = 1a1z, (11.7) 
laup = max(1y,128). (11.8) 


As arather trivial example, note that using equation (11.6) twice we have 
Vaye =1—14e =1- (1-14) = 1a. 


This gives another derivation of equation (2.24), which says that (A‘)° = A. 


Exercise 11.2 (Indicator of a disjoint union). Suppose that A,B are 
any subsets of a given set S. Show that 


A and B are disjoint <— = lavgp=14+4+18. (11.9) 


Here’s a useful fact about numbers. 


254 


11.1. Indicator Functions 


Exercise 11.3 (Sum equals max plus min). Show that for any real num- 
bers t, u, 
t+ u = max (t,u) + min (f, u). (11.10) 


As a consequence of Exercise and equations (11.7) and (11.8), we 


have 
Laup + lang = 14+ 1,. (11.11) 


Rewriting equation (11.11) as 
laup = 14+ 18 -—1ans, (11.12) 


taking expectations of both sides, and then by applying equation (11.4), we 
obtain Theorem [2.25] the Inclusion-Exclusion formula. 

You may remember that in the original proof for inclusion-exclusion, we 
used the trick of breaking up events into disjoint pieces. That seemed useful, 
but we don’t seem to be using that trick with this approach. Or are we? 
Maybe the pieces are the one-point sets in the sample space. 


Since lng is the zero function if and only if AN B is the empty set, 


equation (11.11) gives us equation (11.9) as a special case. 
Note that using equation (11.7) we can rewrite equation (11.12) as 
laug =14+1g8-1,1.. (Iba) 


Exercise 11.4. One can generalize Theorem [2.25] to the case of n sets. The 
usual proof using set operations has two steps. In the first step, one guesses 
the correct formula in some way. In the second step, one proves the conjec- 
tured formula by induction. 

Instead of using that approach, derive the correct formula for the case 
n = 3, and prove it at the same time, by applying equation (11.13) twice. 
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Remark 11.2 (Subadditivity for indicators). We used the Inclusion- 
Exclusion formula to prove subadditivity, Theorem [2.26] But now that we 
have indicator functions, it seems more direct just to note an obvious sub- 
additivity fact for indicator functions: for any events D,,...,D,, if D is the 
union of these events, then 


k 
In < 50 1p,. (11.14) 
j=l 


To prove equation (11.14), just evaluate both sides for a sample point w, as 
follows. 

Ifw € Dthenw € D, for at least one 7. Thus the left side of the equation 
is one, and the right side is greater than or equal to one. 

And if w is not in D, then the left side of the equation is zero, and the 
right cannot be negative. 

So equation holds. Take expectation of both sides of the equal- 
ity in this equation. Using monotonicity and linearity of expectation, and 


equation (11.4), you will produce Theorem [2.26] 


The next exercise is important for understanding how our concepts fit 
together. 


Exercise 11.5 (Random variable as a sum of constants times indi- 
cators). Let X be a random variable with finite range. Let 71,...,2; be the 
distinct values in the range of X. 


(i) Explain why 


X= ty lixa=s} Se Li yas, %- (ISIS) 


(ii) Show that linearity of expectation and equation (11.4) imply equa- 
tion (10.1), which is the defining formula for E [X]. 
Thus for finite range random variables, linearity of expectation and equa- 


tion (11.4) imply everything about expected values. 
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Remark 11.3 (Integral over a set using indicator notation). In cal- 
culus we are very familiar with the idea of integrating a function over a set, 
usually when the set is an interval. 

In equation (3.16) we gave the general definition for integrating a function 
over a set. The function g in equation is easily seen to be equal to 
14f,so [ 4d can be conveniently expressed using indicator functions: 


[an frase. (11.16) 


We'll sometimes use this notation later, for convenience. 


Example 11.4 (Additivity for integration). Back in Section [3.5] we men- 
tioned that for disjoint sets, the integral over the union is the sum of the 
integrals over the disjoint sets making up the union (equation (3.14): i.e. if 
A= D,U Daz, where D,, D2 are disjoint, then 


[r=fre ae (11.17) 


This follows from the definition of integration over a set. We can express 
the argument very neatly by using indicator function notation and equation 
(11.16). Equation (11.9) tells us that 


Lg = Lp, + Lip: 


Since integration is an additive operation, integrating this equation gives 
equation (11.17). 


Exercise 11.6 (Writing a random variable using cases). Let D,,..., Dn 
be events for some probability model. Suppose that: 


(a) The events D,,...,D, are disjoint, and 


(b) The union of D;,...,D, is the whole sample space (. 
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Let Z be a random variable such that Z is constant on each set D; (as in 
Figure (10.1). For each i, let v; be a number such that X(w) = v; for every 
w € D;. (Thus if D; is empty, v; can be any number.) 

Under these assumptions, prove that 


Zao, ec tpg (11.18) 


11.2 Expectation over a set 


We defined integration over a set in equation (3.16). In this section we define 
a similar concept for expected value. The idea is simple but convenient. 


Definition 11.5 (Expectation over a subset of the sample space). 
Let a probability model be given with sample space 2. For any real-valued 
random variable and any event A, define the expectation of X over A by 


expectation of X over A = E[Z], (11.19) 


where 


fa ifw € A, men 


0 otherwise. 


Since this definition is intended to apply to general random variables, we 
have to mention that equation is the definition of the expectation of 
X over A, if E[Z] exists. If E[Z] does not exist, the the expectation of X 
over A is undefined. Of course if the range of X is finite, the range of Z is 
finite, so E |Z] certainly exists, and there is no problem. 

Indicator function notation (Definition [11-1p, gives us a handy way to 
write expectation over a set: 


expectation of X over A = E[{14X]. (11271) 
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This definition applies to any random variable, although in the present 
chapter we will only study its properties in the case of random variables with 
finite range. 

Expectation over a set is especially useful when dealing with combining 
expectations obtained under differing assumptions. The key concept for that 
purpose is called conditional expectation and it deserves its own section. 


11.38 Conditional expectation 


The following definition holds for general random variables, not just random 
variables with finite range. 


Definition 11.6 (Conditional expectation). Let X be a random variable 
on a sample space 2, and let A be an event with P(A) > 0. The conditional 
expectation of X given A, denoted by E[X | A], is defined by 


E([1,4X] 


ELXI4|= Sa 


(11.22) 


Equation is a convenient mathematical formula for conditional 
expectation, but the physical meaning of conditional expectation is better 
expressed in the following lemma. Like Definition this lemma holds for 
all random variables, not just random variables with finite range. 


Lemma 11.7 (Conditional expection uses conditional probabilities). 
Define the conditional probability set-function P by 


P(D) = P(D| A) (11.23) 


for any event D. The definition of P says that it is the probability distribution 

which incorporates additional knowledge, namely that event A has occurred. 
Then: the conditional expectation of X given A, which was defined in 

equation is equal to the expected value of X using P instead of P. 
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When we are using P as our probability set-function we can denote the 
expectation of X by Es [X]. Thus Lemma can be stated compactly as: 


E[X|A] = Es [X]. (11.24) 


And this equation expresses the fact that conditional expectation really does 
mean what its name suggests. 


Proof. For the proof we assume that X has a finite range. 
Let 21,...,2% be a list of the distinct values in the range of X. 
By Exercise 


k 
X= S- Lil¢x=a,}- 
i=1 


Let A be an event with P(A) > 0. Then 


k 
14X =) 0 alaltxqey. 


i=1 
Using equation (11.7), this says that 


k 
14X = > tiLangt=an- 


i=1 


Taking expected value of both sides of the equation, and using equation 
(11-4), 


E(14X] = SP (AN {X = a;}), 


i=l 


sO 


k 
E (1 “ P(AN on = 
Pig = Yon PAR ea y nPix =a) 
=1 

By the mathematical definition, the left side of this equation is E |X| A]. 

The right side of the equation is equal to Eg [X] by the definition of 
expected value. 

This proves equation (11.24). 
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Our proof justified Lemma for the special case of a random variable 
X with finite range, but remember that this lemma holds for all random 
variables X. 

Conditional probabilities may be simpler to use than the original proba- 
bility distribution, since they permit us to break up a calculation into cases. 
In particular, we have the Law of Total Expectation, which generalizes the 
Law of Total Probability (Theorem |4.3). 


Theorem 11.8 (Law of Total Expectation). Let D,,...,D, be disjoint 
events with union D, and let X be a random variable such that E [X] exists. 


Then 
k 


B[1pX] =) P(D;) E[X| Dl, (11.25) 
i=1 
where for each i if P (D;) = 0 we replace E[X|D,] by any number we like. 
If D=Q, this becomes 


E[X]= SP (D;) E[X| Dj). (11.26) 


i=1 


Exercise 11.7. Let D,,...,D, be disjoint events with union D. Prove that 


k 
In = > 1p,. (11.27) 
a1 


This equation generalizes equation (11.9), of course. 


Exercise 11.8 (Proof of the theorem). Prove Theorem 
Equation (11.22) is convenient for this purpose. 


Example 11.9 (A simple model with two cases). Consider the following 
very simple game. 
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A player tosses a fair coin. If the toss gives a head, the player wins ten 
dollars, and the game ends. Otherwise, a fair die is rolled, and the player 
wins the number of dollars shown on the die. Let X be the amount that the 
player wins. We wish to find E[X]. 

We will actually calculate E [|X] twice, and compare the two approaches. 


Method 1 

What should be the sample space for this calculation? A reasonable 
choice is to take the set 2 to consist of asymbol H together with the numbers 
1528) 25, Bs 

The symbol H represents the outcome for which the coin toss results in 
a head. The number 7 represents the outcome for which the coin toss results 
in a tail and then score on the die roll is 7. 

The distribution P for this sample space is obtained as follows. 

By the description of the experiment, P(H) = $. 

By the multiplied-through version of the conditional probability formula 
(equation (4.2)), 

P({i}) = P(H®)P({i} | #°). 

In this equation we have physical probabilities, since the abstract model is 
still being defined. We know the probabilities for a fair die roll, and we know 
the roll of the die is not affected by the coin toss, so 


3 1 
P({i} |) = 5, 
Thus in our model we should define 
11 1 
P({z}) 56 18 for 4= 1, 2,3,4,.5,6 


Using the definition of E [|X] gives 


Method 2 
Let’s start the problem again, and apply the Law of Total Expectation, 
Theorem [11.8 


E [X] = P(H)E[X|H]+ P(HE[X|H"). (11.28) 
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Given H, X = 10, so 
EX || = 10. 


The physical description of the situation given H° is that we are rolling a fair 
die, and _X is the number shown on the die. We also know by equation (11.24) 
that E |X| H‘] is equal to the expected value of X in this situation. So we 
have a little self-contained problem, which we know how to solve: finding the 
expected value for one roll of a fair die. Thus 


E[X|H] ==(1+24+34+44+5+4+6). 


Ole 


Since P(H°) = 1/2 = P(A), substituting in equation (11.28) gives: 


2 2 


BIxX] = 510+5 (3 


1 
g(1+243+445+6)). 


Either of the two methods of calculating E [|X] seems equally easy in the 
present example, but we can already notice two significant benefits of using 
the Law of Total Expectation. 


(i) The problem is decomposed in a natural way into simpler problems, 
which are “self-contained problems”, i.e. problems which can be con- 
sidered separately. 


(ii) There is no need to define a sample space for the original problem. 


These benefits are more significant in complex problems. 


Exercise 11.9. A player has a fair die and a coin with success probabil- 
ity 1/5. In the first stage of the experiment, the player rolls the die once. 
Let k be the number obtained. The player then tosses the coin k times. Find 
the expected number of successes obtained in this experiment. 
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11.4 Solutions for Chapter 


Solution (Exercise |11.1). 


Proving equation (11.5) For any set A, the only possible values for 1,4 
are 0 and 1, Since 0? = 0 and 1? = 1, equation (11.5) follows. 


Proving equation (11.6) From the definitions, 14-(t) = 0 exactly when 
1a(t) = 1, and 1ye(t) = 1 exactly when 1,(t) = 0. This is equivalent to 
saying that 14-(t) = 1 — 14(¢), so equation (11.6) holds. 


Proving equation (11.7) 


When t € ANB, lang = 1, and both of the statements 1,(t) = 1, 
14(t) = 1 hold. Hence 1lyqg(t) = min (1,4(t), 12 (t)). 

When t ¢ ANB, certainly Laqg(t) = 0 by definition. Also at least one 
of the statements t ¢ A, t ¢ B holds. Thus at least one of the statements 
14(t) = 0, 1ge(t) = 0, holds, so min (1,4(t), 1a(t)) = 0 = Luna(t). 

This proves the first equality in equation for all possible cases. 

When t € {0,1} and u © {0,1}, we see by checking cases that tu = 
min(t, uv). This proves the remaining equality in equation (11.7). 


Proving equation When t € AUB, 14ygz = 1, and at least one of 
the statements 14(t) = 1, 14(t) = L hold. Hence 1yyg(t) = max (1,4(t), 1 (4)). 
When t ¢ AUB, certainly 14ug(t) = 0 by definition. Also both of the 
statements t ¢ A, t ¢ B holds. Thus both of the statements 14(t) = 0, 
1,(t) = 0, holds, so max (1,(t), 1p(t)) = 0 = Lana(t). 
This proves equation for all possible cases. 


Proving equation (11.11) Since 14ug = max (14,1) and Laqg = min(1y,13), 
11.11). 


this proves equation ( 


Solution (Exercise |11.2). 


= >: Suppose that A and B are disjoint subsets of the given set S. 

Let x be a point in S. 

IfeeAthenzx € AUB ands ¢ B. Thus 1,4(z) = 1, laug = 1 and 
1p =0. Since 1 = 1+ 0, equation holds. 

Similarly equation holds if « € B. 

The remaining case is the case that x € (A U B)°. In this case 14up(x) = 
0 = 1y(x) = 1p(x). Since 0 = 04+ 0, equation holds in this case also. 
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<=: Suppose that equation (11.9) holds. If there were a point x € ANB, 
for that point we would have 1l4ug(z) = 1, la(x) = 1 and 1p(x) = 1. Since 


14141, equation (11.9) would not hold at z. 
Since equation does hold, we conclude that AN B is empty. 
Solution (Exercise (11.3). If t # u, then one of these two numbers is the 
max, and the other is the min. Thus t + u = max (t, u) + min (t,w). 

On the other hand, if t = u, then both number are equal to the max and 
both are equal to the min. Hence once again we have t + u = max(t, wu) + 
min (t, u). 

This proves equation (11.10). 


Solution (Exercise |11.4)). Using equation (11.13), 
lausBuc = 1(ausyuc = laus + 1c — laus 1c. 
Applying equation (11.13) to 1Laus, 


lausuc = 14+18-—14184+1¢—-(144+12- 1,418) 1c 
S4gtigsde=14ig=1 ee 21 ils. 


Using equation (11.7), this says that 


lausuc = 14+18+4+1¢—- Lans - Lanc — 1Bnc + Lansnc- 


Taking expectations of both sides, and then applying equation (11.4), we 
obtain 
P(AU BUC) = P(A) + P(B) + P(C) 
—P(AnN B) -—P(ANC) —-P(BNC) (11.29) 
+P(ANBNC). 


This is the generalization of equation of Theorem to the case of 
three events. 

Notice that the plus and minus signs alternate, depending on the number 
of sets in the intersection. This pattern holds for all n. 


Solution (Exercise |11.5). (i) For any outcome w, we must show that 
X (w) = 211 pyran} (Ww) +. + tpl xe, (W). (11.30) 
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Let x; be the value of X(w). Then w € {X =2;}. 

On the right side of equation (11.30), 1yy=2,;(w) = 0 unless i = j. In 
that case, 1yx=2,}(w) = 1. 

Thus the only surviving term on the right side of the equation is 7;-1 = 2;. 
This equals the left side, so we are done. 


(ii) Taking expected value of both sides of equation (11.30) gives 
E[X] =27E [1 ¢x=01}1 +...+2,E [1¢x=0,}| ; 
Applying equation to each term on the right side of this equation gives 
BE |xX| =a PUX =ai}) 4.2. +2, PQA — 2x}: 


This is equation (10.1). 


Solution (Exercise |11.6). For any outcome w, we must show that 
Z(w) = vilp,(w) +... + Un1p, (w). (11.31) 


Suppose that w € D, (it has to be somewhere). 

On the right side of equation (11.31), 1p,(w) = 0 unless i = j. In that 
case, 1p,(w) = 1. 

Thus the only surviving term on the right side of the equation is v;-1 = v;. 
This equals the left side, so we are done. 


Solution (Exercise 11.7). Method 1 The argument here is similar to 
the solution to Exercise [11.6 
One checks directly that equation holds by verifying that it holds 
for every w. 
It holds when w € D; for some 7, using disjointness, and it holds when w 
is not a member of any D;, since both sides of the equation are zero in that 
case. 


Method 2 When n= 1, the equation is obvious. 
When n = 2, the equation is equivalent to equation (11.9). 


To obtain the equation for general n, use The Old Induction Trick for 
generalizing from 2 to n, (Exercise |2.23). 
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Solution (Exercise . By equation (11.22), 


for each 7. 


That is, 
P(D;)E|X|D;] = E[1p,X] for each i. 


Thus equation (11.25) is exactly the statement that 


E [X15] = Yel X15] 


By linearity of expectation, this will be true if 


k 
X1p=)° X1p,. 


i=1 


And this last equation holds, since equation (11.27) says that 


k 
Ip= S- 1p,. 
i=1 


Solution (Exercise [11.9). Let D; be the event that the result of rolling 
the die is the number k. 

Let X be the number of successes when tossing the coin. By equa- 
tion (11.26), 


=F (D,)E[X| Dy]. 


Since the die is fair, P(D;,) = i. - all k. 

To find E[X|D,], think of a simple little self-contained experiment, 
namely tossing a coin & times. We know from previous work that the expec- 
tation is kp, where p is the success probability of the coin. Thus 


E[X|D,] =k (5). 


11,12 ,13 14,15 16 14+2434+44+5+6 21 7 
65°65 65 65 65 65 30 ~ 30 +107 
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Chapter 12 


Independent random variables, 
first applications 


12.1 Two independent random variables 


Definition 12.1 (Independence for physical random variables). Let 
X and Y be physical random variables defined for the same experiment. We 
say that X and Y are independent if every event defined in terms of the 
values of X is independent of every event defined in terms of the values of 
ie 


As usual we can express an independence statement in terms of infor- 
mation. For independent random variables, information about the observed 
value of X tells us nothing about the observed value of Y. 

Definition [12.1]is a statement about physical random variables, not math- 
ematical random variables. Here is a definition of independence for mathe- 
matical random variables. 


Definition 12.2 (Independence for mathematical random variables). 
Let X and Y be real-valued random variables for some probability model. 
We say that X and Y are independent if, for any subsets S and T of R, the 
events {X € S} and {Y € T} are independent. 
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Definition applies to all mathematical random variables, not just 
those with finite range. Probability theory uses lots of mathematical random 
variables with infinite range, but in this chapter we will focus on the finite 
range case. 

In the special case of mathematical random variables with finite range, the 
next lemma tells us that we can check independence by considering events of 
the form {X = x} and {Y = y}. This is simpler than using Definition [12.2| 


Lemma 12.3 (Checking independence for finite-range random vari- 
ables). Let X and Y be finite range random variables. Then the following 
statements are equivalent. 


(i) X and Y are independent. 


(ii) For every number z in the range of X, and every number y in the range 
oly, 
{X = «},{Y =y} are independent events. (12,1) 


Proof. (i) = > (ii): Assume that X and Y are independent random vari- 
ables. Let S = {x} and let T = {y}. 

Since {X = x} = {X € Stand {Y = y} = {Y € T}, Definition[12.2]gives 
equation (12.1). 

(ii) — > (i): Assume statement (ii) holds. 

Let G and H be any sets of real numbers. Let c,,..., cz list the distinct 
numbers in the range of X which are members of G. Let d,,...,d, list the 
distinct numbers in the range of Y which are members of H. 

For any sample point w, if X(w) € G then X(w) must be equal to some 
c, and if Y(w) € H then Y(w) must be equal to some d;. Thus 


{xeGh=|]{xX =a}, (Ye H} =U {Y =4}. (12.2) 


i=1 j=l 
Similarly, ifw € {X €G}N{Y € HA}, then X(w) = 2; for some i and 
Y(w) = y; for some j. Thus 
{X €EG}N{Y € H}=(J{X =a} 1 {Y =y}. 
aj 
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Hence 


P({X € GEN {Y € H}) = P({X = 24} NY = yi} 
= DPX =2)P(Y =). 


Using the distributive law, this says 


P({X EG} n{Y € H}) = (xe (X = °) (xe (Y= «)) 


iEC jED 


=P(X €G)P(Y € #). 


Hence condition (ii) holds. 


Statement (i) of Lemma/12.3]and statement (ii) of Lemma|12.3] are logi- 
cally equivalent, for finite-range random variables. However: statement (i) of 
Lemma|12.3]is convenient for applying independence to a physical situation, 
while statement (ii) is convenient for showing that independence holds. 


Exercise 12.1. Consider a two-step experiment, in which a fair coin is tossed 
twice. 

Let X; = 1 if toss 2 gives success, and X; = 0 otherwise. 

Let Y; represent a “payoff” connected with this experiment. The rule is 
that Y; = 5 if toss 7 gives success, and Y; = —5 otherwise. 

By Lemma [12.3] we know that X, and Xp» are independent! Also Yj, Yo 
are independent, for the same reason. 

More complicated random variables may be harder to analyze. In this 
problem you are asked to consider other random variables. 


(i) Let X3 = X 1X9. 

Prove that X,, X3 are not independent. 
(ii) Let Y3 = Yi Yo. 

Prove that Y,, Y3 are independent. 
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We are concentrating on finite-range random variables at the moment. 
But for future reference, here’s a criterion that saves work when checking 
independence for general random variables. 


Lemma 12.4 (Intervals are sufficient). Real-valued random variables 
X,Y are independent if for all intervals [a, 6], [c,d], the events {X € [a,b] } 
and {Y € [c,d]} are independent. 


The proof depends on technicalities and is omitted. 


12.2 Independent indicators 


Lemma 12.5 (Sets are independent if and only if their indicators 
are independent). Let A,B be events in some model. Then the indica- 
tor functions 14,1, independent are independent if and only if A,B are 
independent. 


Nothing surprising here if you think about information. Knowing whether 
or not A occurred is exactly the same as knowing the value of 14, and knowing 
whether or not B occurred is exactly the same as knowing the value of 1 ,. 
The proof is just a matter of checking that the definitions mean what you 
think they mean. 


Proof. Let 14,1, are independent random variables. 

Then for any number x in the range of X, and any number y in the range 
of Y, {X =z} and {Y = y} are independent. 

In particular, {X = 1},{Y = 1} are independent. That is, A, B are in- 
dependent. 

Conversely, suppose that A, B are independent. 

By Lemma [5.6] the independence of A, B implies the following facts: 


e A, B are independent, 
e A, B° are independent, 


e A°, B are independent, and 
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e A‘, B are independent. 

These statements say that: 

e {1,=1}, {15 =1} are independent, 

e {1,=1}, {15 =0} are independent, 

e {14 =0}, {1s = 1} are independent, and 
e {14 =0}, {1s =0} are independent. 


We have shown that for every x in the range of 1,4, and every y in the 
range of 1g, {14 = x}, {1p = y} are independent. 
Thus by definition, 14,1, are independent. 


12.3 Functions of independents 
Let’s review notations from calculus: 


Definition 12.6 (Compositions of functions). Let f and g be any func- 
tions. Suppose that for any point ¢t in the domain of g, g(t) is in the do- 
main of f. Then f(g(t)) is defined, and we can define a new function h 
by h(t) = f(g(t)). We will often denote this function simply as f(g). This 
notation is only a shorthand for the function which sends t to f(g(t)), but it 
seems to convey the meaning clearly. 

People sometimes refer to the function f(g) using words, as “the compo- 
sition of f with g”. However, it’s safer to write the composition symbolically, 
since someone might interpret the same phrase as meaning g(/f). 

Another notation for the composition of functions is fog. Thus fog and 
f(g) mean the same thing, and 


fo g(t) = f(g()). (12.3) 


We can also use this notation when more functions are involved. For 
example, f ogoh is the function f(g(h))). 


273 


Chapter 12. Independent random variables, first applications 


Let X and Y be real-valued physical random variables which measure 
quantities for the same experiment. Suppose that X and Y are independent 
physical random variables. Let y and @ be functions on R. Since y(X) is 
determined by X, any information given by y(X) is also information about 
X. Similarly any information given by 0(Y) is also information about Y. 

By assumption, information about X does not change your opinion con- 
cerning information about Y. So information about y(X) does not change 
your opinion concerning information about 6(Y). Hence y(X) and 6(Y) must 
be independent. 

The mathematical version of this physical statement is expressed in much 
the same way. The proof is just a matter of using the definitions carefully, 
so it may not be a high priority for readers. The physical meaning of the 
lemma is important, of course. 


Lemma 12.7 (Functions of independents are independent). Suppose 
X and Y are independent real-valued random variables for some probability 
model, and y and @ are functions on R. Then y(X) and 6(Y) are indepen- 
dent. 


Proof. We will use Definition Let X and Y be real-valued random 
variables for some probability model. Let S and T be subsets of R. We must 
show that the events {y(X) € S} and {6(Y) € T} are independent. 

Let 


G={z: zER, y(z) € S}. (12.4) 


To say that y(X(w)) € S is logically equivalent to saying that X(w) € G. 
Thus 
{p(X) € S}={X EG}. (12.5) 


Similarly, let H = {z: z€R,A(z) € T}. Then 


{AY) eT} ={Y € A}. (12.6) 


Definition tells us that {X € G},{Y © H} are independent events. 
Thus {y(X) € S},{0(Y) € T} are independent events. 

Since S,7 were any subsets of R, y(X),0(Y) are independent by Defini- 
tion [12.2 
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The following exercise is a simple test of Lemma [12.7| 


Exercise 12.2. Let X,Y be finite range real-valued random variables, and 
suppose that X,Y are independent. 

Let G = 5X and let H = 16Y. Are G,H independent? Sure they are, 
it’s physically obvious. 

But let’s check that. We could appeal to Lemma But it seems 
instructive at this stage to use a more basic criterion. 

So: using condition (ii) of Lemma [12.3] and without using Lemma [12.7| 
show that G, H are independent. 


12.4 Expectation of a product 


The next theorem extends the multiplicative property from independent 
events to independent random variables. 


Theorem 12.8 (Expectation of a product of independents). Let X 
and Y be independent random variables defined for the same probability 
model. Assume that E[X],E|Y] exist. Then E[X Y] exists, and 


E[XY] =E[X] E[Y]. (12.7) 


Proof. The theorem holds for general random variables, but we will only 
write down a proof for the finite-range case. 

We can follow the pattern of the proof of Lemma|10.10| 

Let 21,..., 2%» be the distinct numbers in the range of X, and let y1,..., Ym 
be the distinct numbers in the range of Y. 

Let dD; = {X = CY => ve 

By Theorem (10.7), 


n m 


E[XY]= 50S aiy; P(Dy). (12.8) 


i=1 j=l 
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Up to this point we have not used the assumption that X,Y are indepen- 
dent. This tells us that 


Hence 
n m 


E[XY]= 2 2 yjP(X = 2;)P(Y = yj). (12.9) 


Using the distributive law and the definition of expected value, we see 
that 


pixy]= (SoaPir=)) (SomPor=x)), (12.10) 


Thus E[X Y] = E[X]B[Y]. 


Exercise 12.3. Let X and Y be independent random variables with finite 
range, such that E[X?] exists and E[Y®] exists. Prove that E[X°Y*] = 
E[X?|E[y®]. 


Independence of random variables is a powerful tool in analyzing the 
behavior of probability models. 


12.5 Independence for a sequence of random 
variables 


Just as in the case of independence for events, we can consider a (possibly 
long) sequence of random variables defined on the sample space of some 
experiment. Here’s the general definition. 


Definition 12.9 (Independent sequences of random variables). Let 
Xj,...,X, be real-valued random variables which are defined on the same 
sample space. The random variables X,,..., Xn are said to be independent 
if the following holds. 
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For any subsets D,,...,S, of R, 


P(X, — S1, eae ee — Sn) = P(X, — S1) os -P(X, — Cee (12.71) 


Take a moment to check that this definition agrees with Definition [12.2] 
when n = 2! 


When the random variables happen to have finite range, things are sim- 
pler, much as in Lemma [12.3] 


Lemma 12.10 (Checking independence for a sequence of finite-range 
random variables). Let X1,..., Xn be finite range random variables. Then 
the following statements are equivalent. 


(i) X1,...,X, are independent. 


(ii) For every sequence of numbers 21,...,2%n, where x; is in the range of 
Xj 


The properties of independent sequences of random variables are simi- 
lar to the properties of two independent random variables. Physical expe- 
rience continues to be a reliable guide, and we will cheerfully write down 
mathematical equations without proofs, based on our ideas about physical 
independence. 


Exercise 12.4 (Maximum of independent). Let X1,...,X, be an inde- 
pendent sequence. 

Suppose that each random variable X; has a uniform distribution on 
{1,...,10}. That is, suppose P(X; =7) = 1/10 for i=1,...,10. 

Let M be the maximum of X),...,X,. Find P(M < 4). 
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Exercise 12.5 (Minimum of independent). In the setting of Exercise|12.4| 
let m be the minimum of Xj,..., Xn. Find P(m > 4). 


We could have fun proving independence properties based on the defini- 
tion. For example, 


Aj,...,An are independent <=> 1y,,...,14,are independent. (12.13) 


However, it seems better to just assume facts like that, and keep going. 


12.6 Random walk 


Sequences of independent random variables occur in many situations, in areas 
such as economics, physics and biology. 

In this section we present a simple example. 

A bug is moving around randomly on the integers. The bug is sitting on 
the origin initially. 

The movement is as follows. 

Every second, a fair coin is tossed. The first toss takes place at time 1, 
and every second thereafter. Each toss results in a “step” by the bug, as 
follows: 


e If the bug is on integer k, a successful toss (a head) makes the bug 
jump instantly to k + 1, and 


e if the bug is on integer k, a failure makes the bug jump instantly to 
k—1. 


This type of mathematical motion is called “random walk”, or more specif- 
ically, “simple symmetric random walk”. (The word “simple” refers to the 
fact that the bug can only jump a distance of one unit. The word “symmet- 
ric” is used because the bug does not favor right or left.) 

Since the bug changes direction frequently, it is moving very inefficiently. 
A basic question: how far from the origin is the bug likely to be, after n 
steps? 
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We can start by thinking about a more abstract model for the bug’s 
motion. Let X; be a random variable that represents the result of the coin 
toss at time 7. We will represent success by 1 and failure by —1, so that 

1 1 
P(X; = 1) = — and P(X; = —1) =i 
2 2 
Our physical picture tells us that in the mathematical model, X,,..., Xn is 
an independent sequence of random variables. 
Define Sp = 0, and let 


S, = X,+...+ X, for each n= 1,2,.... (12.14) 


Spo is the location of the bug at time 0. At time 1, the bug has just taken 
a single step, and is now at S; = X,. At time 2, the bug has taken its second 
step, and is now at S; = X; + X». And so on. 

There are powerful techniques for analyzing random walk. In the present 
chapter we will just consider one fact, which is a consequence of Theo- 
rem [12.8 


Lemma 12.11 (Random walk squared distance). 


EB s,| (12.15) 


Proof. Note first that, from the distribution, E|X;] = 0 for every step. So 
(by linearity) E[S,,] = 0. That doesn’t help us much. 

Notice that the range of S,, contains quite a few points. It’s easy to see 
that the largest possible value in the range is n (when every step is success), 
and the smallest possible value is —n (when every step is failure). 

A little more thought will convince you that when n is even, the range 
consists of all even numbers from —n to n, and when n is odd, the range 
consists of all odd numbers from —n to n. 

The bottom line is that it won’t be easy to find E[S?] directly from the 
definition of expected value. 

However we can expand $?. 


E[S;] =E (Sx) =E 
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Thus 
E[S?] =E 


3 xa] (12.16) 


t,j=l1 


Equation (12.16) is the usual distributive law manipulation. If it looks 
strange, please write out the case n = 3 to see what is going on! 
Additivity of expectation then gives 


E[S;] = 3 E[X;Xj]. 


Now comes the key point. For i # j, X;, Xj; are independent random vari- 
ables. In that case Theorem tells us that 


E[X;X,] = E[X,] E[X,] =0. 


Thus, keeping only the surviving terms on the right, we have: 
B [S23] = 0B [x7]. 
i=1 


Much simpler! The values of X; are —1 or 1. So X? = 1, always. Hence 
E[X?] = 1, and we have E[S$?] =n, as claimed. 


Let’s pause to admire equation for amoment. The largest possible 
value for S? is n?. When n is large, the average value of S$? is much smaller 
than n?. So equation is telling us that the distribution of S,, does 
not put much weight near the extreme values. 

We would like to make that last statement more precise. 


12.7 The Markov Inequality 


To extract a little more information from equation (12.15), one can use an 
inequality (you will do that in Exercise [12.6). The inequality you will use is 
known as the Markov inequality. Despite its simplicity, the Markov inequality 
is useful in many situations. 
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Figure 12.1: Sample space [0,1], X(u) = u, uniform probability 
Lemma 12.12 (The Markov Inequality). Let Y be a nonnegative random 
variable such that E [Y] exists. For any number a, 

aP(Y >a)<E[Y]. (12.17) 


See Figure [12.1] 


This lemma applies to general random variables. (If Y has finite range 
then of course the assumption that E [Y] exists is automatically true.) 

The proof of the lemma given here is also general, since it only uses 
properties of expectation which always hold. 


Proof. Let A = {Y > a}. We claim that 


al, <Y (12.18) 


holds everywhere on the sample space. 
Indeed, consider a sample point w. 
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If w € A, then by the definition of A we must have a < Y(w). Since 
14(w) = 1, that is exactly what equation asserts. 

On the other hand, if w ¢ A, then 14(w) = 0, and 0 < Y(w) holds since 
Y is assumed to be nonnegative. 

Thus equation holds everywhere. 


Since expectation is monotone, 
E[al,4| < E[Y]. 


And 
E [a1,4] = aE [1,4] = aP(A), 


using linearity for the first equality and equation (11.4) for the second equal- 
ity. 


Figure illustrates the Markov oa when the sample space is 
(0, 1] and os is 5 the uniform distribution on [0,1]. In eae 12.1] 4. a = 8, and 
the blue area is aP(Y >a). As in een Tor ines 20 u) du, and 


so E[Y] is equal to the entire area under the curve. 


Exercise 12.6 (Searching for Charlie). Your pet bug escaped from the 
origin at time 0, and has undoubtedly been performing simple symmetric ran- 
dom walk ever since then. 10,000 seconds have elapsed since Charlie started 
roaming. You are distressed and searching frantically. Use the Markov in- 
equality to estimate the probability that Charlie at least 500 units away from 
the origin. 


Exercise 12.7. Let X be a random variable such that E [e* | = 5, and let 
B > 0 be a number. 
Find an upper bound estimate for P(X > (3). Your estimate should be 
in the form: 
P(X > 8) < something. 
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Exercise 12.8. Is Lemma |12.12)an interesting statement when a < 0? 


Exercise 12.9. Let X be a nonnegative random variable such that E |X] 
exists, and let a be any number. 

Suppose someone asks you to find an upper bound for P(X > a). 

The Markov inequality gives you an upper bound for P(X > a), not 
P(X >a). Can you use the Markov inequality to get what you need? 


Exercise 12.10 (Using E|f(X)|). Let X be a random variable. Suppose 
that f is a nonnegative function, which is strictly increasing on the range of 
Xx. 
You are given that E|f(X)] exists, and E|f(X)] = c, for some number c. 
Let 6 > 0 be a number. Find an upper bound estimate for P(X > £), in 
terms of 3, f, and c. 


12.8 Solutions for Chapter 


Solution (Exercise |12.1). 


(i) Think about information. Suppose someone tells you that X, = 0. Do 
you know the value of X3? Heck yes! X3 is zero! So P(X3 =0|X, =0) =1. 

In contrast to the case that X, = 0, if someone tells you that X, = 1, 
then the value of X3 is just the value of Xj. Since X,, X2 are independent, 
there are still two possible values for X3 in this case, and indeed 


1 


Thus 
P(X, =0) 4, =0) F P(iAg=0] Xi +0). (12.20) 


283 


Chapter 12. Independent random variables, first applications 


So knowledge about X, can definitely change our opinion about X3. 

More formally, Exercise [5.8] tells us that {X1 = 0}, {X3 = 0} are not in- 
dependent events. 

The definition of independence for random variables then tells us that 
X 1, X3 are not independent. 


(ii) There is more than one way to write down a solution. Exercise 
seemed to match our thinking in part (i) so we’ll go with that here. 

Notice that the range of Y; is the set {5,—5} and the range of Y3 is the 
set {25, —25}. 

If Y, = 5, then Y3 = 5Y3. Hence 


1 
P(¥s = 25|¥, = 5) = P(¥2 = 5/¥i = 5) = 5. 


Similarly, 


P(Y; = 25|¥, =—5) = P(¥% = -5|Y; = —5) = 7 
Since {Y¥3 = 25}° = {¥3 = —25}, Exercise|5.8|tells us that {Y; = 5}, {Y3 = 25} 
are independent. 

Using Lemma 5.6} we also have that {Y; = 5}, {Y3 = —25} are indepen- 
dent, {Y; = —5},{Y3 = 25} are independent, and {Y; = —5},{Y3 = —25} 
are independent. 

Hence, by definition, Y;, Y3 are independent. 


Solution (Exercise [12.2). Let %1,...,2, be the distinct numbers in the 
range of X, and let y1,...,y¢ be the distinct numbers in the range of Y. 
Then 521,..., 52, are the distinct numbers in the range of G, and 16y;,..., 16y¢ 
are the distinct numbers in the range of H. 
For any 2,7, 


(e008, }={4 =}9,}, and. (2 j=l} atyY =a 


Since X,Y are assumed to be independent, {X = x;},{Y = y,} are indepen- 
dent events. That is, {G = 52;},{H = 16y;} are independent events. Since 
this is true for every value 5x; of G and every value l5y; of H, by definition 
G, H are independent random variables. 


Solution (Exercise (12.3). This exercise is just checking that you noticed 
Lemma [12.7] 

By that lemma, X°,Y® are independent random variables. Then we can 
use Theorem [12.8 
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Solution (Exercise {12.4}. Since M is the maximum, to say that M < 4 
is the same as saying that X; < 4 holds for every 2 = 1,...,n. Thus 


{M <4} ={X1 <4}N...9{X, <4}. 


Since the events {X1 < 4},...,{Xn < 4} are independent, 


4 4 ae 
PUM <4) =P(K $4) PUG $4) = 5 = (3) 


Solution (Exercise |12.5}). Since m is the minimum, to say that m > 4 is 
the same as saying that X; > 4 holds for every i = 1,...,n. Thus 


{m>4}={% > 4}N...N{X, > 4}. 


Since the events {X1 > 4},...,{Xn > 4} are independent, 


6 6 6 \" 
P(M > 4) =P(X, > 4): P(X > a= Tae @ : 
Solution (Exercise [12.6). Let S,, be the random variable defined by equa- 
tion (12.14). Thus S,, is Charlie’s location after n steps. 
Let n = 10000. 
We would like to estimate P(|S,,| > 500). 
That is the same as P(.S? > 250000). 
Using the Markov Inequality, 


250000P(S7 > 250000) < E[S7] = 10000. 


Hence 19000 
P(\8,| 2 500) < —_~= 1/25. 
250000 


(With more work, one can get a much sharper estimate. But this inequal- 
ity is already interesting. ) 


Solution (Exercise (12.7). We are told that E [e*] = 5, so we know how 
to apply the Markov Inequality to e*. 
What does {X > 6} look like in terms of e*? 
The exponential function is a strictly increasing function, isn’t it? So the 
statement that X > 6 is exactly equivalent to the statement that e* > e’. 
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So {X > 8} = fie* > P 2 And, using the Markov Inequality (with Y 
replaced by e* and a replaced by e?), 


e?P(e* > e®) < E [e*] =D, 


Thus 5 
P(X > 8) < =. 
That finishes the problem. 


Solution (Exercise [12.8). Since P(Y > a) > 0, for a < 0 we always have 
aP(Y >a) <0. 

Since Y is assumed to be nonnegative, E[Y] > 0. 

So the statement that aP(Y >a) < E[Y] is rather obvious. 


Solution (Exercise[12.9). If X > a then certainly X > a. In other words, 
the statement “X(w) > a” is a stronger statement than “X(w) > a”. Hence 
{X >a} c{X Sa}. 

So we always have P(X > a) < P(X >a). And so the upper bound 
estimate for P(X > a) is already an upper bound estimate for P(X > a). 


Solution (Exercise |12.10). Applying the Markov inequality to f(X), we 
have 
aP(f(X) >a) <E[f(xX)| =c. (12.21) 


Since f is strictly increasing on the range of X, to say that X(w) > £ is 
exactly equivalent to saying that f(X(w)) > f(@). Hence 


P(X 2 B=) = P( F(X) 2 f(A)). 
Let a = f(@). Then equation (12.21) says that 
f(B)P(X 2 B) <c. 


Whenever f(3) > 0, we can write this as 


1 
P(X 2 8) S FH 


C. 
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Waiting times 


13.1 Waiting for the first head, with a dead- 
line 


We have worked with mathematical random variables that have finite range. 
Now we are going to broaden our view. The present section may suggest why 
this is desirable. 

Consider tossing a coin n times. Let p be the success probability for the 
coin, where as usual by success for a toss we mean that the toss results in a 
head. Let q=1-—p. 

Let’s study the random variable T;,, which we define as the time of the 
first success in the sequence of n tosses, if success ever occurs. Otherwise let 
T, =. 

We might imagine that n is our deadline, and we shut down the experi- 
ment at time n if there has been no success by that time. 

Our first goal is to write down the distribution of T),. 

Let A; be the event that toss 7 gives success. Since the results of the 
tosses are assumed to be physically independent, the events A;,...,A, are 
mathematically independent in the sense of Definition 

By assumption, P (A;) = p. For any k= 1,...,n—1, T, > k means that 
no success occurred on tosses 1 through k. Thus 


(TS ESAS A AS (13.1) 


sO 
PC, > k) =e". (13.2) 
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Since q° = 1, the same equation holds for k = 0. (As usual, interpret 0° as 1 
to include the case p = 1 and k = 0.) 
That gives us the value of P (T,, > k). If you want P (T,, = &), note that 
for lk <7; 
{T, =k} ={T, >k-1}-—{T, > k}. 


Thus for 1 <k <n, 
P=) ag Seay (13.3) 


Please check that we can obtain the same probability by noting that 
{T, =k} = AGN... AG_, Ax, and using independence! 

We have found P(T,, = k) for k < n. To get P(T;,, = 7), we go back to 
the description of the experiment. 

Rember that we shut down the experiment by time n, whether or not 
success has been achieved. 

Thus {T,, =n} = {T, >n-1}, so 


PO =f) =7""". (13.4) 
Combining these facts gives: 
Lemma 13.1 (Distribution of T,,). 
PW, == pir l<kew Pane. (las) 


P(T = k) = 0 otherwise. 


Exercise 13.1. As a check, please verify in some way that the values we 
have found for P (TT, = 1),...,P (7, =n) actually add up to one. 


Let’s start work on E[T,]. 


Using Definition |10.2| 
n-1 
EZ, |= S- kq*®'p + ng”. (13.6) 
k=1 


13.1. Waiting for the first head, with a deadline 


If p = 0, this formula gives E[T,,] = n, which is obviously correct, since a 
head will never occur. 
From now on assume that p > 0. We need a trick to evaluate the sum in 
the formula for E [T;,]. 
First, recall how we find the sum of a finite geometric series. Let 
Sn =~l+qt+q@+...+¢q", where for a moment we allow q to stand for any 
number. The trick is to multiply by q. 
This gives gs, =qtq?+...+q"t! = s,—1+4+ "71. Solving for s, when 
q # 1 gives the familiar formula for the sum of a finite geometric series: 
n+1 
S$ =l+qt¢t...4¢q= os 
l—q 
Differentiate both sides of equation with respect to q. This gives 


(13.7) 


iy 1 — grt 1)\q" 1 — grt 1)q” 
Sagtt = 4 a8 qr q _ (n+ Va" (13.8) 
= (lq) Lg P p 
Replacing n by n—1 in equation (13.8), and substituting in equation (13.6), 
1— aq” 
E[T_] = —* — ng + ng", 
Pp 
so i a23h 
a ie (13.9) 


Exercise 13.2. Derive equation (13.9) in a different way, without using the 
differentiation trick. 
First find E[T,4.] — E[T,]. 


Since q < 1, lim,_,.. gq” = 0. Hence for large n, 
E[T,] © 1/p, (13.10) 


which is a tidier expression for E [7;,], although now it is only an approxima- 
tion. We note that 1/p grows larger as p becomes smaller, which is completely 
reasonable, since it is harder to obtain a head when p is smaller, and thus it 
should take longer. 

Equation approximates one number by another. Can we think of 
this approximation as arising from a new model? 
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13.2 Time of first success in oo trials 


The simplicity of equation suggests that we might gain a bit of ele- 
gance by replacing a probability model with large n with a probability model 
in which n = oo, that is, a model in which the coin tossing goes on forever. 
(By accepting a more complex concept we obtain a simpler calculation. Con- 
ceptual thinking is our human strength, so this seems like a good strategy in 
general.) 

We will not try to use a sample space to build a rigorous mathematical 
model for infinitely many coin tosses. This is possible, and is routine in ad- 
vanced courses, but it requires significant technicalities. Here we will simply 
use the rules of probability theory to calculate physically relevant numbers. 

We are studying how long it takes to obtain the first head. For that 
purpose, thinking about an infinite number of tosses seems like a reasonable 
idealization. After all, in the physical picture there is not a natural limit on 
the number of tosses. And the time of first success does not depend at all on 
what happens in coin tosses after the first success. 

Let T denote the time of the first head, in an infinite sequence of coin 
tosses, if a head is ever obtained. The mathematical random variable T is 
often simply referred to as the waiting tme for the first success. 

Of course, if a success is never obtained, we need a way to record that 
fact. So: 


By definition, if success is never obtained, T = oo. (13.11) 


Notice that if p = 0 we will never obtain a head, so the probability that 
T = o is one, and there is really nothing else to say about this situation. 
From now on assume p > 0. Let gq = 1-—p. 
We showed that P (T;, > k) = q*, fork =0,...,n—1. The same argument 
shows that here we have 
P(T >k)=¢ (19.12) 


for every k =0,1,.... And P(T =k) =P(T >k-1)—P(T>k). Thus 
Lemma 13.2 (Distribution of 7). 

P(T =k) =q*"'p for 1 <k <0, (13.13) 
P(T = k) = 0 otherwise. 
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Notice that {T = co} Cc {T > k} for every k. Thus P(T = co) < ¢& for 
every k. Since we are restricting attention now to the case that q < 1, g* > 0 
as k + oo. So 

PUT no) = when p> 0. (13.14) 


Definition 13.3 (The geometric distribution). The distribution of T 
given by equation (13.13) is usually called the geometric distribution, with 
parameter p. 


The waiting time 7’ has no direct physical meaning, since no real ex- 
periment goes on forever. A mathematical random variable with a direct 
physical interpretation is 7;,, and T' is one step further away from the phys- 
ical world. However, part of the usefulness of the mathematical model for 
infinitely many tosses is that we can almost picture it. And so we can still 
be guided by reality as we use it. 

We notice that P(T > k) — 0 rapidly (“exponentially fast” or “geomet- 
rically fast”) as k + oo. This suggests that calculations using T should give 
us good approximations to the results of calculations with T),. 


Exercise 13.3 (The memoryless property of the geometric distribu- 
tion). Someone is tossing a coin repeatedly, and waiting for the first success. 
The success probability is p, where 0 < p< 1. Let T be the number of tosses 
needed to obtain the first success. Let P be the distribution based on the 
knowledge that the tosser has, at the time when the sequence of tosses starts. 
Then P(T = n) is given by the geometric distribution with parameter p. 

Now consider the viewpoint of a spectator who comes upon the tosser 
after n tosses have been made. The spectator learns that up to this time no 
success has been obtained. 

The spectator decides to wait until the first success. Thus the spectator 
will wait for T —n additional tosses. Based on the knowledge that the tosser 
(and the spectator) have at that moment, the probability that T —n = m is 
given by 

P(T-n=m|T>n)=P(T>n+m|T>n). 


Calculate this conditional distribution for T — n, and show that 
P(T >n+-m|T > n) =P(P > m). (15.15) 
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This is called the memoryless property of the geometric distribution. It shows 
that knowing how long you have already waited for success is not helpful in 
estimating the additional tume that you will have to wait. 


What about the expectation of T’? We have not defined expected values 
for random variables which do not have finite range. But we can easily guess 
the right definition. Simply replace the usual sum by an infinite series. Then 


E[T] = 5 kP(T =k). (13.16) 
k=0 
The term with k = 0 contributes nothing, of course. So 
B[T) = SokP(T =k) = kg’ 'p. (13.17) 
k=1 k=1 
By definition, 
k-1, 4: k-1 
do kg p = lim So kg*p. 
k=1 k=1 


Using equation (13.8) 


ioe) : leq 7 
Soha p= fin (tnt oe"), 


k=1 


We have assumed in this discussion that p > 0, so g < 1. Then g” + 0 more 
rapidly than n — oo, so we have both (n + 1)q” > 0 and q"*! > 0. (One 
can use a trick based on the Ratio Test to prove these statements.) Thus 


—_ n+l 1 
lim (— —(n+ ia") a 
We have shown: 

Lemma 13.4 (Expectation of a waiting time). Let T have the geometric 


distribution, given in equation (13.13), with p > 0. Then 


E(T] = -. (13.18) 
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Letting n > oo in equation (13.9) gives 


line, Ey] = EZ) (13.19) 
Noo 
This limiting agreement increases our confidence that 7’ is a useful approxi- 
mation to T},. 


Remark 13.5 (The geometric series). The sum of a geometric series 
is a standard calculus fact: 


— 1 

eS Toy? forall ee (—1,1). (13.20) 
—2z@ 

k=0 


Remember that x° means 1 in this series, even for x = 0. In other words, 
the series is l+a2+a%+27+.... 

To recall why this equation (13.20) holds, remember first that the Ratio 
Test shows the series in equation (13.20) converges, for |z| < 1. Let s be the 
sum of this series. The same manipulation used in equation applies 
here: 


rs=s-—l, and hence s = 7 : 
—2x 


We'll use equation (13.5) in the next examples. 


Example 13.6. Here’s an alternative derivation of equation (13.18). First, 
note that 


E(T] = S— kq'p =p (1+29¢4+3¢ +...). 


k=1 
Then: 
142q¢4+3¢q?+...=l+qt+(@t+q@ re... 
O+747 +9 tix 
C209 Fa tea (13.21) 


+0+0+0+ 43+... 
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Adding the columns in equation (13.21) shows why the equation holds. This 


is not a rigorous argument, but it is convincing. 


Hence 

14+294+3¢?+...= 74 
+h 
+ 

Thus 

14+2¢4+3¢+...=(l+q+7+@ 4H...) (=). 
“a 
l-@q l-—@q 
So 


Exercise 13.4 (One more derivation). Suppose you remember the for- 
mula in equation from your good old calculus days. You also remem- 
ber from calculus that a convergent power series in x can be differentiated 
term by term inside its interval of convergence. The same derivation trick 
that gave us equation can be applied directly, to show: 


ae 1 
k=1 


for x € (-1,1). 
Check this, please. Then use equation (13.22) to obtain equation (13.18). 
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13.3. Solutions for Chapter 
Solution (Exercise |13.1). In this problem we are asked to show that 
PU, = D4+.04+ PG, Ewa h 


with the understanding that in this equation we do not assume that we know 
that there really is a probability distribution P. In this problem when we 
write down the expression P(T,, = k), that is just shorthand for the values 
given in equation (13.5). 

Thus, for our work in this problem, for 1 < k < n, P(T, = k) simply 
means q*—'p, and P(T;, = n) means q”!. 

Let 

fo) =P, = 1) -...4+PQ, =) (13.23) 


We have to check that f(m) = 1 for all n. We have introduced the notation 
f(n) to make it easy to explain the argument. 

By definition, f(1) = P(T; = 1), and by definition, P(T,; = 1) = q°, so 
at least for n = 1 we are okay: 


FQ) =L. 


You know what we are going to do next, don’t you? We will relate f(n + 1) 
to f(n). 

Notice that for 1 < k < n, P(T, = k) = q*"'p, and this expression 
doesn’t involve n. So for 1 <k<n—1, P(T,_-1 =k) =q*"'p too. The two 
terms for the same k are equal. 

So, if we write out f(n) — f(n — 1) as a difference of sums, most of the 
term are going to cancel out. For 1 < k <n-—1, every term P(T, = k) will 
cancel with the corresponding term P(7;,-1 = k). Looking at the surviving 
terms, 


fa) -f@m-D=PG, =n-1+P(Q, =n) -P(i-1 =n-1) 


ka . ‘ 2 _ (13.24) 
=q pt gt -g tag "(p+ g—-q"’? =0. 


n—2 


The final equality holds since g"’~?(p + q) = q"-7(1) =¢q 


Solution (Exercise |13.2). From the definitions, T; = 1. 
Notice that by the definition of T;,, if success occurs by time n or earlier, 
then T7441 = Th: 
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If success does not occur by time n, then necessarily T,4, = n+ 1 and 
Ty, = 17. 
Thus 7,41 — T;, is either 1 or 0, and P (T,,41 — T;, = 1) is the probability 
of no success by time n, i.e. q”. 
Hence 
E[Tiaa) — E|T,,| = EB [f44 — T,| =". (13.28) 


Since 
E[T,] — E[T)] = (B[%)] — B[M]) +...+ (B[T,] - E[Tn-1]), 


we have 
E(T,]-E(Mi]=q+¢+...+4"°. 


Since obviously E [7] = 1, 
E[T,] =1l+qt+...+q"". 


By the formula for the sum of a finite geometric series, this agrees with 
equation (13.9). 

One could also use induction and equation (13.25) to verify equation (13.9), 
of course. 


Solution (Exercise . By the conditional probability formula and 
equation (13.12), 


P({T >n+m}N{T > n}) 
P(T > n) 
_ P(T>n+m)— qr 


P(T>n) =a 


P(T>n+m|T>n)= 


The second equality holds because {7 >n+m} Cc {T > n}, and so 
{T>n+m}n{T>n}={T>n+m}. 


Solution (Exercise [13.4). 

[oe] [o-e) d [oe] 
pa) 2a ae 
7 a0 k=0 k=l 
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because «° is constant. Thus 
dx _ 
k=0 k=1 
By equation (13.20), 
28 = 
k=0 = 
0) 
dw, 1 
dx 2 ~ (l=2)’ 


proving equation (13.22). 

Now let’s find E [7]. Let g = 1-—p. By equation (13.13), P (I =k) = 
q'1p, for 1 < k < oo, and P(T =k) = 0 otherwise. 

Thus 


[o-e) [o-e) 7 [o-e) 7 1 p 1 
E(T] = SokP(T=k) =)  k¢ 'p=P) ka! = Pa = = 
K=1 k=1 k=] 
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Chapter 14 


Random variables with 
countable range 


14.1 Countable range 


The waiting time T defined in Section [13.2] of Chapter [13]is a good example 
of arandom variable which does not have finite range. Now we should discuss 
the general properties of random variables which are somewhat similar to T’. 

Readers can probably guess many of the details, but please give some 
time to this. Developing a feeling for the theoretical concepts will pay off in 
your later work in probability. 


14.2 Countability 


Definition 14.1 (Countable sets). A set is said to be “countable” if its 
elements can be listed in a finite or infinite sequence. A set which is not 
countable is said to be uncountable. 

It should be emphasized that, by definition, a finite set is a countable set. 


Remark 14.2 (Sizes of infinity). After reading Definition [14.1] it is nat- 
ural to wonder if there even 7s such a thing as an uncountable set, since 
an infinite sequence seems to be the natural way to describe an infinite set! 
However, it was proved by Georg Cantor (1874) that the real line is uncount- 
able, in the sense of Definition In other words, given any sequence of 
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real numbers, there must always be real numbers which are left out, and 
do not appear in the sequence ([8]). Cantor’s remarkable discovery showed 
that, from the standpoint of mathematics, there are indeed different sizes 
of infinity. Of course, out in the real world, life continued much as usual, 
despite this disturbing news. 


Recall the definition of a bounded function (Definition [10.18). The wait- 
ing time T defined in Section [13.2]is definitely not a bounded function. How- 
ever, we should be aware that a random variable which has infinite range can 
still be bounded. The random variable 1/T is an example. 


14.3 Countable additivity 


The theoretical properties of mathematical probability theory are simpler if 
we add two technical assumptions about mathematical probability models. 
Fortunately, these technical assumptions hold for any model that is com- 
monly used. 


Probability Assumption 14.1 (Union of an infinite sequence of ab- 
stract events). For any probability model that we use, whenever A, Ao,... 
is a sequence of abstract events, the union of these sets is also an abstract 
event. 


When we consider probabilities for an infinite sequence of events, one 
more assumption will be made: 


Probability Assumption 14.2 (Additivity for an infinite sequence 
of abstract events). For any probability model that we use, whenever 
Aj, Ao,... is a sequence of disjoint events with union A, 


(oe) 


P(A) = 5 P(Aj). (14.1) 
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The property described in equation is usually referred to as count- 
able additivity. 

From now on, Assumption and Assumption will hold, even if 
we don’t mention them. 

An infinite sequence of abstract events has no direct physical meaning. 
Similarly the action of summing an infinite series of probabilities has no direct 
physical meaning. Thus Assumption [14.1]and Assumption [14.2] do not seem 
to contribute any physical insight to our probability models, despite their 
technical usefulness. So it is interesting that in practice, for any experiment 
we can always choose a valid probability model such that Assumptions [14.1] 
and hold. 

Where are Assumption and Assumption going to be used? 
Sometimes we add up an infinite sequence of probabilities, in the process 
of calculating a physically meaningful probability value. But our assump- 
tions are also used behind the scenes, to guarantee that abstract events, 
probabilities, and expectations exist and have convenient properties. 

Since we now assume countable additivity, we could generalize some ear- 
lier statements. Typically the generalization amounts to simply replacing a 
finite sum by the sum of an infinite series. We usually don’t bother to state 
generalizations of this sort, but simply use them when and if they are needed. 

Incidentally, countable additivity often holds in mathematical models for 
quantities other than probability, for example for quantities that represent 
physical properties such as length, area, weight and displacement. Here’s an 
example. 


Example 14.3 (Chopping up the unit interval). Let A; = (1/2’*', 1/2'] 
for d= Up 1, ess 

It is easy to see that the sets A; are disjoint, that the union of all the sets 
A; is exactly equal to (0, 1], and that the length of A; is 1/2't'. 

Then 


= Sie Ae. a 
Stenstn(a.) => (5) =5>(5) a 


1=0 i=0O 


Thus the sum of the lengths of the intervals A; is equal to the length of 
the union of these intervals, verifying a particular case of countable additivity 
for length. 
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Example 14.4 (More general chopping). Let b) < 6 be real numbers, 
and let b, < b be an increasing sequence of real numbers such that b, — 6 
(one often writes b; /7 b to indicate an increasing limit). 

You can easily convince yourself that the intervals [b;,b;,,) are disjoint, 
and that 


(oe) 


[bo, 6) =a (Jl: bi+1). 


i=0 
Clearly 
length(([bj, bj41)) = bi41 — bi. 


Notice that 


Slength((bj, b:41) 


i=0 


= S- (biti — bi) = (b1 — bo) + (b2 — bi) +... + (bn — bn—1) = bn = bo. 
i=0 


In other words, the sum telescopes. 
Then 


) length((bj, bi41)) = lim ) length((|b;, bi+1)) 
noo 
1=0 i=0 


= lim (by — bp) =b — by = length( (bo, b)). 


Thus the sum of the lengths of the intervals [b;,b;,1) is equal to the length 
of the union of these intervals, verifying another particular case of countable 
additivity for length. 


Exercise 14.1. We have not bothered to describe a sample space on which 
the mathematical waiting time 7 in Chapter [13]is defined. But suppose that 
T and the random variables 7), of Chapter are all defined on the same 
sample space. Then 


Vie HO St? an Ud jae lp UU Sco}. 
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Using countable additivity, calculate P (T,, =) using the sum of the series 
on the right side of this equation. (As usual, we assume that p > 0. ) Check 
that the result agrees with equation (13.1). 


Exercise 14.2. Let T’ be the mathematical waiting time in Chapter [13] 
Find the probability that T is even, assuming countable additivity and 
summing a series. 


Example 14.5 (Even and odd). In the situation of Exercise|14.2} Lemma|13.2| 
tells us that 

P(f=k)=q""'p for 1 < k < co, (14.2) 
and P(T’ = k) = 0 otherwise. You can use equation (14.5) to solve Exer- 
cise [14.2] But let’s try to find the probability that T’ is even, without using 


equation (14.2) explicitly. 


We have 
P(T =k+1)=P(T> kp, (14.3) 
using independence. Also 
P(T>k+1)=P(T>k)q, (14.4) 


using independence. 
Thus 
P(T =k +2) =pP(T >k+1) =pgP(T > k) = gP(T =k +1). 
By countable additivity, 
P(T even) = P(T = 2)+P(T =4)+..., 
while 
P(T odd ) = P(T =1)+ P(T =3)+P(T =5)+... 
=pt+aqP(T =2)+¢P(T =4)+... 
=p+qP(T even). 
Since P(T odd) + P(T even) = 1, we have p+ (1+ q)P(T even) = 1, hence 
q 


P(T even) = a (14.5) 
D 
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14.4 Calculus review: summing an absolutely 
convergent series 


We often have to consider the sum of an infinite series of nonnegative num- 
bers, or, more generally, the sum of a series which converges absolutely. This 


will be the case when adding probability values, but can also happen in other 
situations. We can use some facts from calculus. 


Series Property 14.1 (Convergence test). If the terms in the series are 
nonnegative, and partial sums are bounded, then the series converges. 


Series Property 14.2 (Rearrangement property). If the series con- 
verges absolutely, then the terms in the series can be rearranged in any 
order without changing the sum of the infinite series. 


The Properties and imply another useful calculus fact: 


Series Property 14.3 (Exchanging the order of summmation). Let 
real numbers a;; be given for all positive integers 7 and 7. Suppose that 


~ > |aiz| < 00. (14.6) 


i=1 j=l 
Then 


es = SoS ais, (14.7) 


i=1 j=l j=l i= 


BR 


meaning that both sides of the equation are convergent, and they are equal. 


We won't bother to write out the proof of equation (14.7), but exchanging 
the order of summation is an important trick. 


14.5 Distributions for random variables with 
countable range 


We stated the mathematical definition of the distribution of a general random 
variable X in Definition 9.6] When the random variable X has a finite range, 
equation (9.5) of Section|9.2|/gives a simple formula for calculating P(X € S), 
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using the probability mass function for X. We can use a similar approach in 
when X has countable range. 

Let 21,%2,... be a sequence of distinct values, which includes all the 
numbers in the range of X that are members of W. If the value of X is a 
member of W, then that value must be equal to one of the numbers in the 
sequence. Thus 


{X €W} ={X =m} U{X =a} UV... (14.8) 


Since the values 271, 72,... are distinct, the sets {X = 21}, {X =22},... are 
disjoint. By the countable additivity of probability we have 


This has the same form as equation but now the expression on the right 
can be either a finite sum or an infinite series. 

Using the probability mass function q for X (defined in Definition |9.7), 
equation (14.9) can be rewritten as 


P(X € W) =4q(21) + 9(a2) +-.-.- (14.10) 


Thus the probability mass function for X characterizes the whole distribution 
of X. 


14.6 Expected values: countable range case 


Definition 14.6 (Expected value with countable range). Let X be a 
random variable whose range can be listed in a finite or infinite sequence of 
distinct values x1, 2%2,..... When the sequence is finite, E|X] is defined in 
Definition When the sequence is infinite, the expected value E[X] is 
defined by 


E[X] =) a;P(X =2,), (14.11) 
j=l 
but only in the case that the series converges absolutely, i.e. when 
S— |a;|P (X = 25) < 0. (14.12) 
j=l 
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We see from Definition [14.6]that the expected value of a random variable 
is determined by its distribution. 

Notice that Definition agrees with Definition That’s what we 
want. We are interested in extending the definition of expected value to new 
situations, but we don’t want to change the definition that we’ve already 
given earlier. 

In Definition the expected value of X exists if the series in equa- 
tion converges absolutely, and only in that case. 


Remark 14.7 (The meaning of existence). To clarify our terminology, 
let us agree that if we say that E[X] exists, we mean that E[X] exists as 
real number, in the sense of Definition[14.6] Sometimes we might say “ E [X] 
exists as a real number”, just to avoid any possible misunderstanding. 

For a nonnegative random variable X, if E[X] does not exist as a real 
number people sometimes say that E[X] = oo. That notation is helpful in 
showing what is going on, but, despite that notation, in this book we will 
not say that E |X] exists in that case. 


In calculus, the comparison principle for infinite series tells us that a 
series which is dominated by a convergent series must itself be convergent. A 
similar comparison principle holds for integrals of unbounded functions, and 
for expected values of unbounded random variables. 


Fact 14.8 (A comparison principle for unbounded random vari- 
ables). E[X] exists if and only if E[|X]|] exists. Furthermore, if E[X] 
exists and |Y| < |X| everywhere then E[Y] exists also. 


Fact [14.8] holds for general random variables, not just random variables 
with countable range. In the case of an unbounded random variable with 
countable range, we have defined expected value in terms of an infinite series, 
so Fact is an immediate consequence of the comparison principle for 
infinite series. For general random variables, mathematical expected value is 
defined in a less direct way, but the comparison principle holds in the general 
case also. 
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14.7 Key properties 


Let’s collect some facts that we know about expectations. 


Theorem 14.9 (Four key properties of expected values). Four key 
properties of expected value are valid for all random variables: 


Linearity: If E|X] and E[Y] exist then E[X + Y] exists, 
E[X+Y]=E[X]+E[Y]. (14.13) 
This is the additive property of expectation. 
Also, if E[X] exists then for any number c, E |[cX] exists, and 
E [cX] = cE [X]. (14.14) 
This is the scaling property of expectation. 


Monotonicity: If E |X] exists and E[Y] exists, and if X¥ < Y holds every- 
where, then 
E[X] < E[Y]. (14.15) 


Expectation of an indicator This is equation (11.4), which says: 
E [1,4] = P(A). 


Comparison principle This is Fact |14.8] which says: 


E[X] exists if and only if E[ |X|] exists. Furthermore, if E [X] exists 
and |Y| < |X| everywhere then E[Y] exists also. 


It is important to keep in mind that these properties hold for all random 
variables, not just countable range random variables (see Theorem |15.2). 
The comparison principle is only needed for unbounded random variables. 


Exercise 14.3. Consider a mathematical random variable X such that all 
the values of X are positive integers, and such that for some c > 0, P(X = 
nn) =e for n= 1,2,.< 
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(i) Does such a mathematical random variable actually exist’? 
(ii) Assuming that X exists, does E [|X] exist? 


Exercise 14.4. Let b;, 7 = 0,1,2,..., be strictly increasing numbers in [0, 1), 
with bp = 0. Suppose that b; 7 1 as 7 — oo. 

It is clear that the intervals |[b;,b;41) are disjoint and have union equal to 
(0,1). You may use this fact in what follows. (Please sketch a picture if this 
fact is not clear!) 


(i) Consider a probability model with sample space [0,1] and the uniform 
probability distribution. Let X(t) be the length of the interval [b;, b;41) which 
contains t. Write down a formula for EX] as an infinite series of numbers. 


(ii) Suppose that b; = 1—- 3, for 7 =0,1,2,.... 
Calculate the exact numerical value of E [|X] for the random variable in 
part (i). 


Exercise 14.5. Consider the probability model described in part (i) of Ex- 
ercise Let Y be the random variable defined by 


_ sin (bj41) — sin (b;) 


Y 
( ae 


for every t € [b;,b;41). Calculate E[Y]. 


Exercise 14.6. Four key properties of expectation are stated in Theorem|{14.9} 
Let X be a constant random variable, say X = 7 everywhere. As a 
simple consequence of the definition of expected value for finite-range random 
variables, you know that E [|X] = 7 (Lemma 10.3). 
How would you use properties stated in Theorem to show that 
E[X]=7? 
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One of the most useful theoretical facts in Chapter[10]was Theorem 
This theorem extends to random variables with countable range, with little 
change. 


Theorem 14.10 (Expectation by cases). Let D,, D2,... be a sequence 
of disjoint events in some model, whose union is the whole sample space. 


Let v1, v2,... be numbers, and let X be a random variable such that 
X = vu; at every point of D;. 
Then 
E[X] = 5) u;P(D)), (14.16) 
i=1 


in the sense that E[X] exists if and only the series on the right converges 
absolutely, and in this case equality holds. 


The proof is similar to the proof of Theorem |10.7| replacing finite sums 
with sums of infinite series. Applying Theorem |14.10] gives the general for- 
mula for the expectation of a function of a random variable: 


Theorem 14.11 (Expectation of a function of a countable-range 
random variable). Let Z be a countable-range random variable on a sample 
space 2. We do not assume that Z is real-valued. The values of Z can be 
anything. Let the distinct values in the range of Z be vj, v2,.... 


Let vy be any real-valued function whose domain includes v1, v2,.... Then 
E[(Z)] = >. g(vi)P(Da), (14.17) 
i=1 


in the sense that E|y(Z)] exists if and only the series on the right converges 
absolutely, and in this case equality holds. 
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14.8 Calculating expectation using the tail of 
the distribution 


Definition 14.12 (The tail of a distribution). For any real-valued ran- 
dom variable X, probabilities of the form P(X > t) are sometimes called tail 
probabilities, particularly when one is studying the behavior of P(X > t) as 
t > oo. Asa function of t, P(X > t) referred to as the tail of the distribution. 
(Similar terminology applies to P(X < —t).) 


If we know the tail of the distribution, then we can calculate everything 
else about the distribution, with a little work. In particular, there is a nice 
recipe for calculating expected values of nonnegative random variables. The 
next lemma gives the most common case. 


Lemma 14.13 (A tail expectation formula). Let X be a nonnnegative 
random variable, and let a > 0 be such that the range of X is contained in 


the set of numbers na, n = 0,1,.... Then 
E[X]=a)_ P(X > ka) (14.18) 
k=1 
Proof. 
aS > P(X > ka) 
k=1 
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Exercise 14.7. Use equation (14.18) to calculate E [7], where T is the wait- 
ing time defined in section [13.2] 
Note that this is the same calculation used in Example [13.6] 


We have not yet defined expected values for general random variables, 
but expectation can be defined in general. Lemma|14.13}]is a special case of 
the following general theorem, which holds for every random variable. 


Theorem 14.14 (Expectation using the tail integral formula). Let X 
be a nonnegative random variable. Then 


E[X] = qe P(X > t)dt. (14.19) 


Equation (14.19) is completely general, in the sense that if either side of 
this equation exists then both sides exist and are equal. 


Exercise 14.8. Suppose that X satisfies the assumptions of Lemma |14.13 
Show that equation (14.19) implies equation (14.18). 


14.9 Solutions for Chapter 


Solution (Exercise |14.1). Let A; be the event that toss i gives success. 
Then 
{T =n} = ASN... ALN An, 


while 
{T= 1} = AL NA, 


n—1) 


sO 
{f=} = (Ai Nn AL Ag) U Al iso Alo 11AG)s 
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Thus 
{(l, =A sql =n oir Sah. 


Clearly 
{Snail antl U4 Sc. 


By independence, P(T = k) = q*"!p. Also P(T = 00) = 0. Thus 
= k-1 _ n 4 n __ yt 
P(T > n) -( > i 19) +07 Lyf Sot f= Tore 
k=n+1 k=n+1 i=0 
This gives 
PI, =n) =q" ‘p+ qh=q'(ptqg=q' 
Solution (Exercise |14.2). 
{7 even) = {2 a2 tT = 4} US 6} Ul vee 


Using countable additivity, 


P(T even) = P(T = 2)+P(T =4)+P(T =SUP(T = 2) =S oP 'p 
l=1 t=1 
oO at pq q 
Pr = MD OY = Te Tage) Te 


Solution (Exercise[14.3). (i) A mathematical random variable has to be 
defined on a sample space. So let’s try Q = {1,2,...}, and define X(n) =n 
Then at least this X will have positive integer values. 

We are supposed to have 


for some constant c. 
With our definition of X, P(X =n) = P({n}). So we want to have 


P({n}) =<. 


This definition will give us a genuine distribution provided that 


> P({n}) = 1, 
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1.e. 
CO 
Cc 


a 
n=1 
So finally we see the essential requirement: is it true that 


(oe) 


ya <0? 


n=l 


Well, yes, it is true, by the Integral Test from calculus. So now we just define 


and we have a genuine distribution, such that X does have the stated prop- 
erties. 


(ii) From our definitions, if E |X] exists then 


E[X] = So nP(X =n) =->°-. 


But it is a well-known calculus fact is that reas does not converge. So 
E[X] does not exist. 


Solution (Exercise (14.4). (i) By definition, 
{X= bjs — bj} = [bj bj41)- 
Thus 
P(X = bj41 — bj) = dj 41 — bj. 
Every point ¢ in the sample space is a member of exactly one of the sets 
[b;,bj;41). Hence the range of X consists of the points b;,,—b;. By definition, 


E[X] = oe — 6;)P(X = bj41 — b;) = woe =i) 


(ii) In this case, 


= 7 1 1 1 <ea7ie 
Bi =>°(5- ga) =X (sa) =) 
j=0 j=0 j=0 
ica 7iy FT 7 1 
-;>(3) 41-4 3 
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Solution (Exercise |14.5)). Every point t in the sample space is a member 
of exactly one of the “4 [b;,b;41). Hence the range of Y consists of the 
points 

sin (b;41) — sin (b;) 


Dina — by 


By definition, 


E[Y] = ¥ (= (bj41) — sin “) Pp (r __ sin (bj41) — sin )) 


j=0 bj41 — 5; bj41 — 6; 
~. (sin (b;41) — sin (0; ~ 
->( Cis a) bj41 — 65) =o (sin (0j41) — sin (0;)). 
a0 j+1 j j=0 


The final series telescopes: 


n 


> (sin (bj41) — sin (0;)) 


= (sin (b,) — sin (b9)) + (sin (bg) — sin (b))) +... + (sin (b,) — sin (b,_1)) 
= sin (b,) — sin (bp) = sin (b,) . 


Letting n — oo, we see that 

3 (sin (bj41) — sin (b;)) = sin 1, 

j=0 
so E[Y] =sinl. 
Solution (Exercise . From the definitions, 

X=T lo. 
By linearity and the fact about the expectation of an indicator, 
|x| = 7 Elle) =7P OQ) =jfe l=. 

Solution (Exercise {14.7} . By equation (13.12), 

P(T >k)=P(T > (k-1))q**. 
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By equation (14.18), 
Bin] =og= 0-9-2 =! 
k=1 n=0 1— q p 


in agreement with equation (13.18). 


Solution (Exercise |14.8}. By assumption, the range of X is contained in 
the set of numbers na, n = 0,1,.... Thus for na << t < (n+1)a, {X >t} = 
{X >na}. 

Hence for na < t < (n+ 1)a, P(X >t) = P(X > na). Thus 


(n+1l)a 
i P(X >t) dt=aP(X > na): (14.20) 


a 


By equation (14.19), 


(n+1l)a 


eixl= [Poa yf 


Let k = n+1 in the summation. Then é runs from 1 to oo, and {X > na} = 
{X > ka}. This gives equation (14.18). 


PX Side 5 aP(X > ria). 


n=0 
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Chapter 15 


Exponential waiting times 


An exponential waiting time is the continuous-time analog of the coin-tossing 
waiting time that was introduced in Section [13.2] The range of the expo- 
nential waiting time is not a countable set. On the contary, it is the whole 
interval [0, 00). 


15.1 The exponential distribution 


Definition 15.1 (The exponential distribution). For each > 0, let hy 
be the function on R defined by 


h(t) = (15.1) 


Ae EE SD, 
0 otherwise. 


It was shown in Exercise that i. hy, = 1, so that hy is a probability 
density on R. 
The function hy is referred to as the exponential density with parame- 
ter A. The distribution with probability density hy is called the exponential 
distribution with parameter A. 
Any random variable having this probability distribution will be referred 
to as an exponential waiting time. 


In order to establish properties of exponential waiting times, we are going 
to have to calculate some expected values. The next two sections say how to 
do that. 
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15.2 Facts about general expectations 


In this section we'll outline some facts that are true for all expectations. 

We will not state a rigorous definition of expectation for general random 
variables, but such a definition can be given fairly easily. The general defini- 
tion is consistent with the definitions given previously for the cases in which 
X has a finite range or a countable range. 


Theorem 15.2 (Properties of expectations of general random variables). 
(i) For every bounded random variable X , E [|X] exists. (Bounded func- 
tions are defined in Definition |10.18}) 


If X is an unbounded random variable then E [X] exists if X is not too 
large. 


(ii) The value of E |X] is determined by the distribution of X, i.e. random 
variables with the same distribution have the same expected value. 
(And since random variables with the same distribution must have 
the same expectation, we sometimes speak of “the expectation of the 
distribution” , rather than the expectation of the random variable.) 


(iii) The four key properties of expectation stated in Theorem hold for 
the general definition of expected value. 


In the case of a finite-range random variable, the expected value has a 
frequency interpretation, as stated in Probability Fact What can we 
say more generally? 


Remark 15.3 (Bounded random variables and experiments). Sup- 
pose that a bounded random variable X is intended to model a measured 
value in an experiment, and X does not have finite range. (Perhaps the ex- 
periment involves measuring a random distance along a road, or the weight 
of a random lump of butter, so it would be unnatural to use a finite-range 
random variable.) In this case the E [|X] has the same physical interpretation 
described in Section It is the long-run average of the measured value 
of X. 
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What about the interpretation of an unbounded random variable? If X 
is unbounded, even if E[X] exists we won’t try to state a direct physical 
interpretation for E[X]. In our work we will think of unbounded random 
variables simply as mathematical tools, which help us to understand bounded 
random variables. 


Remark 15.4 (Mean zero nonnegative random variables). Let X be 
a nonnegative random variable such that E|X] = 0. Does that imply that 
X(w) = 0 for all w? Our picture of random variables certainly suggests that 
something like this must be true. But it’s not exactly true. For example, 
suppose that the 2 is the unit interval with the uniform distribution, and let 
X(w) = 0 for all w except w = 1/3. Then certainly E[X] = 0, no matter 
what X(1/3) is. For example, we could have X(1/3) equal to a trillion, and 
it would make no difference. 

What is true is that P(X = 0) = 1. We say in words “X is zero with 
probability one”. 

Do we need to prove this statement? It seems very plausible. A proof is 
given in Appendix [F| 


15.38 Expectations when there is a density on 
the sample space 


We have noted two situations where it can be natural to use probability 
densities. In Section the sample space is an interval of the real line, 
and probabilities are defined by equation (3.7). When the sample space is a 
region in the plane, we mentioned that probability densities can be used in 
a similar way, though we didn’t bother to given details. 

Whenever a distribution has a probability density, the probability of an 
event is given by integrating the density over the event. 

Let’s recall the concept of integration over a set (introduced after equation 
in Section [3.4p. 

Suppose we have a sample space (92 on which integration is defined. Q 
doesn’t have to be an interval of the real line, as long as we know how to 
integrate. So 2 might be a region in the plane, or more generally a region in 
R”, or even something else. 
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Let { f denote whatever form of integral we are using on the sample space 
Q. If Q is the real line, calculus books often write [ f as ee f, and if Q is 
the plane, { f is often written as in calculus books as 


[ [tena dx 


The integral of a function f over a set A is denoted by J, f. If A is an 


interval [a,b] of the real line, calculus books often write J, f as f f (2) ,a7, 
and if A is a subset of the plane, [ ,/ is often written as in calculus books as 


ffi 


For thinking about the logic of a problem, it is likely clearer to just write 
the integral over a set A as [ J, as long as the reader understands what sort 
of integral you are working with. 

Incidentally, a precise general definition of integrating a function f over 
a set A, that works on any space, is given in Definition [3.5] 


Probability densities were introduced in Definition in the setting of 
the real line. A general definition of a probability density is the following. 


Definition 15.5 (Probability densities for distributions (general for- 
mulation) ). In general, to say that f is a probability density simply means 
that f is a nonnegative function whose integral over the whole space is equal 
to one, ic. [ f = 1. Here it is assumed that integration on the space has 
been defined. 

To say that a distribution has a probability density f means that the 
probability of an event is given by integrating f over the event, i.e. for any 
event A, 


P(A) = [ f. (15.2) 


Remark 15.6 (Comparison with the previous density definition). 
Remark [3.6]shows that Definition is consistent with the original defini- 
tion given in Definition [3.4] for the real line case. 
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In Definition|3.4| equation ({15.2) is only required to hold for intervals, but 
Remark |3.6|states that if equation (15.2) holds for events which are intervals 
of the real line, then it actually holds for all events. 


Examples of densities on subsets of the real line were given in Sections[3.4] 
and Appendix |E] has some examples of using probability densities on 
subsets of the plane. 

As noted in Example if probabilities are given by a probability 
density f on the sample space, we can also use f to find expected values. 
For any random variable X on the sample space, 


E[X] = [xt (15.3) 


provided of course that the integral of X f exists. 

Here { Xf means the integral of Xf over the sample space 2. If is 
an interval of the real line, say Q = [s,t], then equation is the same 
statement as equation (10.33): 


E |x| = / X(u)f(u) du. 


But equation holds for any sample space, for example if the sample 
space is a region in the plane or in n. The only difference in those other cases 
is that the integral in the equation may require more work. 

We are ready to start computing expectations when probabilities are given 
by a density on the sample space. But what if we not told the definition of 
a random variable X on any sample space? Instead, suppose that someone 
simply gives us the probability distribution of X? Can we still find E |X]? 

Theorem part (ii), says that the distribution of a random variable 
determines the expected value. So, yes, in principle it must be true that we 
can find EX], if someone tells us the distribution of X. 

If it happens that the probability distribution of X is given by a density 
function h on the real line, then there is a neat formula: 


B(x] = i th(t) dt, (15.4) 
provided of course that the integral of th(t) exists. 
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More generally, for any function y on the real line, 


Ele f ” ltyh(t) dt, (15.5) 


provided that the integral of y(t)h(t) exists. 

Equation with y(t) = t is the same as equation (15.4). 

Here’s a summary of how to derive equation from equation (15.3). 

Since you don’t know the sample space that X is defined on, go ahead and 
define your own sample space (2. Let it be the real line, with probabilities 
given by the density function h. Then define a new random variable Z on 
the real line, given by Z(t) = t. 

One can easily show from the definitions that the probability distribution 
of y(Z) is exactly the same as the probability distribution of y(X) (see 
Appendix |C). And since the distribution of a random variable determines 
the expected value, we then know that E|y(Z)] is equal to the expected 
value of y(X) on its sample space. 

And to find E[y(Z)] we have equation (15.3), with X in that equation 
replaced by y(Z) on both sides, and f replaced by h. On the right side, 
the integral is over the real line, and since y(Z(t)) is y(t), we have equation 


(53). 


That’s almost the whole argument, but Appendix [D] has more details. 


Exercise 15.1 (The Cauchy distribution). Let X be a random variable 
whose probability distribution has a density h given by 


h(x) 
for all x € R. Then X is said to have a standard Cauchy distribution. 


1. Find & 
2. Does E[X] exist? 


Cc 


a 15.6 
1+ 2? ( ) 


Exercise 15.2. Let X be a random variable whose distribution is uniform 
on [0,5]. (This is a short way of saying that the distribution on the real line is 
uniform on [0,5] and zero everywhere else. We talked about the probability 
density for the distribution of such a random variable in Example (9.121) 
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Find E [sin X}. 


Now we are ready to get back to exponential waiting times. 


15.4 Properties of the exponential distribu- 
tion 


Exercise 15.3 (Mean of the exponential distribution). Let T be a 
random variable whose distribution is exponential with parameter A. Show 
that 


= 1 
E(r|= [ de" dt = 5" (15.7) 


As noted in Section for any random variable X the function t > 
P(X >t) is sometimes called the tail of the distribution (on the right). 

One of the things we learn from the tail is the rate at which the probability 
of P(X > t) approaches zero as t — oo. It is easy to calculate the tail 
function for the exponential distribution. 

Let X be a random variable having exponential distribution with param- 
eter A. The tail function is P(X > t). For t > 0, we have 


oe) 


P(X >t)= | hy = | rAe™ du=-e™| =e. (15.8) 
{X>t} t t 


Of course, since X has an exponential distribution, P(X > 0) = 1, so P(X > 
ij= 1 forall t <0, 

Clearly the tail of an exponential distribution approaches zero rapidly as 
t > oo. 


Exercise 15.4. In Theorem |14.14] we stated a useful general formula for 
expectation called “The Tail-Integral formula for expectation”, although no 
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proof was given. Test the tail integral formula by calculating the mean of 
the exponential distribution again. 


Exercise 15.5 (Expectation of square of random variable with ex- 
ponential distribution). The exponential distribution with parameter \ is 
defined in Exercise Suppose that X has this distribution. 

Calculate E [X?]. 


Recall from equation and Definition [13.3} when T is the waiting 
time for first success in coin-tossing with p > 0, the distribution of T is the 
geometric distribution, and P(T > k) = q*, where g = 1—pandkisa 
nonnegative integer. 

Let r = — log q, where log gq denotes the logarithm of g with base e. Since 
q < 1, r is a positive number. We have P(T > k) = e~"*, so the tail of 
the geometric distribution seems quite similar to the tail of the exponential 
distribution. 

The similarity of these two distributions suggests that an exponential 
random variable T is the continuous-time analogue of the waiting time for 
first success in coin-tossing. And in fact T is used as a model for a “lifetime” 
or a waiting time in many situations. 

For example, if you are measuring the rate of decay of some radioactive 
material, and waiting for a click on a Geiger counter, the random time you 
have to wait has an exponential distribution. The time spent waiting for a 
telephone call at a sales desk, or a data request at a computer server, will 
often have a distribution which is approximately exponential. 

We speak of 7 as a continuous-time random variable, to distinguish it 
from a random variable such as the time of first success in coin-tossing. Coin 
tosses occur at a countable sequence of times. The times for the coin tosses 
are often said to form a set of discrete times, meaning that there are gaps 
between the times (see Remark [15.9). In contrast to the coin-tossing case, 
here the range of 7’ is a continuous interval. 

Discussing any kind of waiting times, one presumably wants to know when 
the waiting begins. When waiting for a click on a Geiger counter, the wait 
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starts when some observer starts to record data. But radioactivity happens 
continually, so the starting time is not connected at all to the physical process. 
Thus it seems as if the choice of starting time could affect the observed 
statistical distribution of the waiting time. 

However, for the exponential distribution, just as with coin-tossing (Exer- 
cise[13.3), the observed distribution does not depend on the choice of starting 
time. 


Lemma 15.7 (The “memoryless” property). Let T be a waiting time 
having an exponential distribution with parameter ’. For some given time 
s>0, let A={T > s}. Then for any t > 0, 


P(T —s>t|A)=P(T > 2), (15.9) 


7 P(T >s+t|A) =P(T>?). (15.10) 


In Lemma}15.7} one can think of T as the time that some observer has to 
wait, for a particular event to occur. If a new observer arrives at time s > 0, 
there are two possibilities. 


e Ifthe event has already occurred, the new observer has nothing to wait 
for, and does not record the result. 


e Ifthe event has not yet occurred (i.e. if we are in the situation described 
by A), then the new observer is waiting along with the original observer. 
The new observer will wait time J’ — s until the event takes place. 


The left side of equation (15.9) describes the statistical properties of 
the time that the new observer records, given that the event has not 
already occurred when the new observer arrives. 


Exercise 15.6. Prove Lemma |15.7 
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Remark 15.8 (Memoryless really means memoryless). In the setting 
of Lemma [15.7] think of T as the waiting time for some physical event D. 

Let [a,b] be a subinterval of (s,0o). Let T* = T — s. Equation says 
that 

Pu s¢| {7 S46) =P St. (15.11) 

Let P* denote probabilities conditioned on {T > s}. Equation says 
that P*(T* > t) = P(T > t) for all t > 0. Since these random variables are 
nonnegative, P*(T* > t) =1=P(T >t) for allt <0. 

96-P*(1* >t) =P >) for allt. 

Based on that equality, one can show that the whole distribution of 7T* 
is the same as the whole distribution of JT. So all statistical properties must 
be the same for both random variables. 


Remark explains why we do not need to specify the starting time for 
the exponential distribution. 


Let’s think more about the parameter \ in the exponential distribution 
for T. We know now that P(T > t) = e-™“ and that E[T] = 1/,. Both those 
equations tell us that as \ increases the waiting time JT’ becomes smaller. We 
can make that statement more precise by considering the tail function s(t) = 
P(T >t). Thinking about T as the lifetime of a randomly selected object, 
we might call s the “survival function”, since it gives the probability that the 
randomly selected object is still alive at time t. Thinking of the randomly 
selected object as part of a large population, the frequency interpretation 
says that s(t) represents the fraction of the population that is still alive at 
time t. Notice that s(t) = e-~ satisfies a simple differential equation on 
(0, 00): 

s(t) = —As(t), (15.12) 

Suppose that the initial size of the population is N, where N is some large 
number. We would expect that at time t the surviving population would have 
size approximately equal to Ns(t). 

At time t + At, the size of the population is approximately Ns(t + At). 
The number of objects that have died during the time interval [t,t + At] is 
approximately Ns(t) — Ns(t + At). Using equation (15.12), 


s(t) — s(t + At) & —s’(t)At = As(t)At. 
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Thus the number of objects that have died during the time interval [t, t+ At] 
is approximately NAs(t)At. 

The number of living objects at time t is approximately Ns(t). Thus the 
fraction of the current population which dies during [t, t+ A] is approximately 


NAs(t)At | 


Naty OS 


Dividing by At shows that the average death rate per object per unit time 
is X. 

With this interpretation in mind, one might call A the “death rate”, or 
more briefly the rate for the exponential distribution which has parameter A. 

Equation is a differential equation for the survival function s. We 
assume s(0) = 1, since there has not been time for any deaths. Hence we 
are interested in a solution s of equation which satisfies the initial 
condition s(0) = 1. 

(It is easy to show a uniqueness fact: the survival function s(t) = e 
is the only solution of satisfying the initial condition that s(0) = 1. 
The argument is often given in calculus courses.) 


—At 


Remark 15.9 (Discrete versus continuous). One of the meanings of the 
word “discrete” in ordinary language is “individually separate and distinct”. 
The points in any finite set certainly fit this description. When the sample 
space of a probability model is finite, or when the points of the sample space 
can be listed in an infinite sequence, we sometimes refer to the model as a 
discrete probability model. 

A probability distribution such that all points have probability zero is 
often referred to as a continuous probability distribution (“no lumps”). A 
probability distribution which is given by a density fits that description. 

There certainly exist probability distributions which are neither discrete 
nor continuous. For example, a model could have a discrete part and a 
continuous part. However, the distributions in our examples are usually one 
or the other. 

Terminology: recall that we have also described an interval as a contin- 
uous set, meaning a set with no gaps. This is a different use of the word 
“continuous”. Of course a distribution which is given by a density on an 
interval is a continuous distribution on a continuous interval, so the word 
“continuous” applies to this situation in both senses. Our model for throw- 
ing darts (Section [3.6) is also continuous in both senses. 
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15.5 Solutions for Chapter 


Solution (Exercise |15.1). (i) Since h is a probability density we must 


have i h=1. 
cg b b 
h= —— dz = lim = lim c|{ arctanz 
as 1 + a? boo =} boo _b 


= c lim (arctan b — arctan(—b)) = 2c lim arctanb = 2e8 = eT. 
b—00 b—00 2 


Hence c = 1/n. 


(ii) By equation (15.4), 


However, we have to be careful in evaluating this integral, because the inte- 
grand has both a positive and a negative part. The integral of the positive 


part is 
a i... f° @ 
t h(t) dt = — lim —— dt. 


Using the fact that t? > 1 on [1, 00), we have 


* Ent) dt > : li 4 dt : li Ly : lim (log 6 — log 2) 
= — = — hm _ = — lim (10 — 10 = ©. 
0 ~ 1 ee 4: 2 27 b>0 J, t 277 b-+00 6 6 


Thus E [X] does not exist. 
(In this calculation, we use log to denote logarithms to the base e. ) 


Solution (Exercise |15.2)). Let f be the function on [0,5] defined by 
f(a) = 1/5 for all x € [0,5]. Then f is a probability density for the uniform 
distribution on [0, 5}. 
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As in Example|9.12| we obtain a probability density h for the distribution 
of X by extending f. We extend it to be equal to zero on the complement 


of [0,5], so A is given as in equation (9.17): 


n(o) = f if x € [0,5], 


otherwise. 


By equation (15.5), 


oo 5 
E [sin X] = / sin(x)h(s) dx = | sin(2) f= -; COS x 
- 0 


(oe) 


(1 — cos5). 
Solution (Exercise |15.3)). By equation (15.4), 


ioe) ioe) b 
E[T] = il jhe d= | tre dt = lim | trAe™. 
- 0 


és boo 0 


Using integration by parts, 


b - b 
-- | edt | = lim [ te~*” : 
0 0 b— 00 0 


We know that te~*? —> 0 as b — oo, a consequence of L’Hopital’s Rule, and 
of course e~*’ + 0 as b > 0. Hence 


b 
1 —At 
= —€ 
boo r 


E[T] = lim [1 


0 


Solution (Exercise|15.4)). Let X be a random variable having exponential 
distribution with parameter 1. 
By Theorem }|14.14{ 
co co ert 
E[X] = P(X > tat= | e* dt = ——_— 
0 0 X 
Solution (Exercise |15.5). By equation (15.5), 


love) b 
E [X?] = | x ve dx = lim x re ™ da. 
0 


boo 0 


ae | 


0 
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We will use integration by parts to calculate the integral, and then use the 
fact that b?e~° — 0 and be~’ > 0 as b > oo. Thus 


b 


b 
+ fave az 
0 0 
a pty tal __ 
= in ( ia 


b 
= lim [| —22e"*| — 
b—00 


Solution (Exercise |15.6). 


boo 


E [x7] = lim (-e 


P(T>s+t) P(T>s+t 
Fae Re eae re 


P(A) PUT > s) 
By equation (15.9), 
e X(st+t) 
P(T >s+t|A) = —— =e “ =P(T > t). 
e7rs 
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Moments and inequalities 


The mean of a distribution can be thought of as an average value for the 
random variable which has that distribution. It can also be thought of as a 
“central point” of the distribution. In this chapter we introduce the concept 
of the variance of a distribution, which can be thought of as a measure of the 
“width” of the distribution. 

Readers may not wish to work all the exercises in this chapter. The goal 
should be to develop a feeling for how the concept of variance is used. 


16.1 Moments 


If a random variable has a large range, then its distribution can be compli- 
cated, even if the range is finite. We need to identify simple properties that 
help us to understand the behavior of the random variable. 

The expected value of a random variable X is usually the most important 
such property, but an expected value is just one number. We can learn more 
by calculating moments of the random variable. 


Definition 16.1 (Moments of a random variable). For n = 0,1,2,..., 
the n-th moment of the random variable X is E[X"], provided that it exists. 
The n-th absolute moment is defined to be E[ |X|" ]. 


The first moment of X is the expected value E |X]. When it exists, and 
when X represents some property of an experiment, we know E |X] is likely 
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to be close to the average measured value in repeated experiments. The first 
absolute moment gives the same sort of information about the absolute value 
of X. Thus the first absolute moment gives the average size of the random 
variable. 

We will soon study the second moment of a random variable carefully. In 
this section we’ll make a few general remarks about moments. 

By definition, the expected value of X exists if and only if the expected 
value of |X| exists. Mathematical random variables can be so large that their 
expected values do not exist. A random variable with a Cauchy distribution 
(Exercise |15.6) is a typical example. And even if the expected value exists, 
higher moments may not. 

Here are a few more examples. 


Exercise 16.1. (i) Let Q = (0,1), with uniform distribution. Let X(t) = 
1/t. Show that E [|X] does not exist. 


(ii) Let Q = (0,1), with uniform distribution. Let X(t) = 1//t. Show that 
E [X] exists but E[X?] does not exist. 


(iii) Let Q = {1,2,...}, and let P be a distribution on 2 such that P({n}) = 
c/n° for some constant c. Let X(j) = 7. Show that E[X], E[X?] and 
E [X°] exist but E [X“*] does not exist. 


If you are a trusting soul, you will likely assume your random variable 
has all the moments that you need. And most of the time in this book that 
works. For those who worry, the rest of this section has some information 
that you can use when you are trying to confirm that a moment exists. 


Lemma 16.2 (Existence of lower moments). If E [X"] exists then E [X*] 
exists for all k <n. 


Proof. Here’s a handy fact: for any nonnegative integer k < n, 
ja\* <1+4 ||”, (16.1) 


for all x. 
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To check equation (16.1), consider two cases: |x| < 1 and |z| > 1. In the 
first case, |2|* < 1, In the second case, |x|" < |x|”. 

Since |X|* < 1+ |X|", the statement of the lemma follows by the com- 
parison principle for expected values (part (iv) of Theorem [14.9). 


Lemma 16.3 (Moments of a sum). Let n be a positive integer. If E [|X|"] 
exists and E[|Y|"] exists, then E[|X + Y|"] exists. 


Proof. Claim: for any numbers x,y, and any positive integer n, 
Jn+ yl" <2” |x|" + 2” [yl (16.2) 


To justify the claim, remember the triangle inequality (Appendix [B): for 
any numbers 2, y, 
Iz + y| < |2] + [yl. (16.3) 


So equation (16.2) certainly holds for n = 1. In fact, equation (16.2) is a 
cruder inequality than equation (16.3), isn’t it? But hey, we ain’t bein’ paid 
to be fancy here. We just need to get an upper bound for | + y|”. 

Anyway, for general n we have 


Jot yl" < (la] + [yl)”. 
Now consider the case that |x| < |y|. In this case, || + |y| < 2|y|, and so 
Jn + yl" <2 Ty)" = 2" Iyl", 


so equation (16.2) holds. 


The other possible case is that |y| < ||, and of course a similar argument 
works there too. 
This proves the claim. 


By equation (16.2), 
|X +Y|" <2" |X|"+2"|X|”. 
The expected value of 2"|X|" + 2"|X|" exists, by additivity (Theo- 


rem [14.9). 


And then the comparison principle (part (iv) of Theorem |14.9) says that 
the expected value of |X + Y|” exists. 
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Lemma dealt with existence of moments when we add random vari- 
ables. Lemma 16.4] will help us deal with expected values of products. 

Before going on to consider that lemma, please be sure to work through 
the next exercise. It provides us with a pleasing nequality that is often useful. 


Exercise 16.2 (Inequality for a product). Let x,y be real numbers. 
Show that 
Qry <a*+y’. (16.4) 
You can begin by noting that (x — y)? > 0 is always true. 
As long as you are showing that equation holds, you might as well 
also show: 
(x+y)? < 2x" + Qy’. (16.5) 


Since the inequalities in equation (16.4) hold for all x and y, we can of 
course replace x by |X| and y by |Y] in this equality, and obtain: 


2E [|X Y|] < E[X?] +E[y]’. (16.6) 


As a consequence of equation (16.6) and the comparison principle, if 
E[X?],E[Y]° exist, then E[XY] exists, and then we have 


2E [XY] < E[X?] + E[y]’. (16.7) 
This proves 


Lemma 16.4 (Existence of the expectation of a product). If E[X?] 
exists and E[Y?] exists, then E[XY] exists. 


Exercise 16.3 (First moment exists if second does). Show that if 
E[X?] exists then E[X] exists. Do this in two ways: using Lemma [16.2] 
and using Lemma|[I6.4] 
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16.2 Variance 


Now let’s think about the significance of the second moment of a random 
variable. We can understand the second moment of X better by thinking 
about the centered version of X, which is defined as follows. 


Definition 16.5 (Centered random variables). A random variable X 
will be said to be centered if E|X] = 0. In this case X is also said to be a 
mean zero random variable. 

Let X be a random variable such that E [|X] exists. The centered version 
of X is the random variable X — E[X]. The value of X — EX] is also called 
the deviation of X from its mean. 

In calculations we often write E [X] as ju, so that the deviation of X from 
its mean is written as X — . 


Definition 16.6 (Variance). Let X be a random variable whose expecta- 
tion exists. 

The centered second moment, E|[(X — E[X])?], is called the variance 
of X, and is denoted by Var (X), when this expected value exists. It is often 
referred to as the mean square deviation of X. 

The square root of the variance of X is called the standard deviation of 
X, and is often written as o in calculations. 


If one must describe the properties of a probability distribution using 
only two numbers, the mean and the variance of a distribution are usually 
the most informative. The variance tells us how “spread-out” the distribution 
is. 

We often write the expression for Var (X) more neatly using pz to denote 
E [X]: 

Var (X) = E [(X — 1)’]. (16.8) 


Remark 16.7 (Variance of a distribution). The distribution of X deter- 
mines E[X"] and Var (X), so we will at times speak of the “the moments 
of a distribution” and “the variance of a distribution”, or “the moments of a 
density” and “the variance of a density”. 
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Thinking about existence of the variance, note that we only speak about 
the variance of X in situations where E [X] exists. 

Denote E [X] by yw. Since X—p = X+(—p), the n = 2 case of Lemma|16.3] 
tells us that E[(X — 1)] exists if E [X?] exists. 

Since X = (X — yw) + p, the n = 2 case of Lemma also says that if 
E[(X — 1)’] exists then E [X?| exists. 

Thus whenever E [X] exists, Var (X) exists if and only if E[X?] exists. 


That’s all we have to say about existence of the variance. 


The next exercise is of major importance! 


Exercise 16.4 (“Mean square minus square mean”). By expanding 


equation (16.8), show that 


Var (X) = E [X?] — (E[X])’. (16.9) 


Notice that equation shows immediately that Var (X) < E[X?]. 
Also, the definition of Var (X) shows that Var (X) > 0, so equation (16.9) 
tells us that 

(E [X])* < E [X’]. (16.10) 


Incidentally, we can replace X by |X| in equation (16.10), so we also have: 
(E[|X|])? < E[X?]. (16.11) 


Exercise 16.5 (Variance of a constant). Let X be a constant random 
variable in some probability model, so that X = 7 everywhere. Find E[X] 
and Var (X). 
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Remark 16.8 (When the variance is zero). Let X be a random variable 
with mean jz and variance zero. Then E [(X — p2)?] = 0. 

Since (X — jz)? is a nonnegative random variable with mean zero, Re- 
mark says that (X — y)? = 0 with probability one. Thus when the 
variance of a random variable is zero, the random variable is constant with 
probability one, and is equal to its mean with probability one. 


Exercise 16.6 (Variance of a uniform distribution). Let X be the ran- 
dom variable on [0,5] with X(w) = w. Using the uniform distribution on 
[0,5], calculate the mean and variance of X. 

Then, generalize your work. Find the mean and variance of a random 
variable X whose distribution is uniform on an interval |s, t]. 


Example 16.9 (Variance for a coin toss). Let X represent the result of 
tossing a coin. X = 1 means a head (success) X = 0 means a tail. Assume 
PL j= le=@ 

Then E[X] =1-P(X =1)4+0-P(X =0) =p. 

Since X = X? for this random variable, E [X?] = p also. 


By equation (16.9), 


Var (X) = p—p? = p(1—p). (16.12) 


Example 16.10 (Variance of a binomial random variable). Let S;, be 
the number of successes in n tosses of a coin, when the coin has success 
probability p. Using equation (9.3), we will show that 


Var (S,,) = n(1 — p)p. (16.13) 


Lemma 12.11} in Chapter [12] is a typical application of equation (16.13), as 


the reader can now verify. 
Our use of additivity in section]|10.5.1|suggests that hammering away with 
equation (9.3) is not the easiest way to find Var ('S,,). 
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So you may want to leave the justification of equation (16.13) until Ex- 


ercise |16.16} The proof of Lemma }12.11| already used the method of Exer- 
cise [16.16 


But for those who are interested, here’s the algebraic calculation for jus- 


tifying equation (16.13). 
In Exercise {10.9} we found 


using algebraic manipulations. We will extend that method here. 
By Theorem 


E|S,| =, 


E [32] = 50 k?P(S, = &). 
k=0 

Looking at the solution for Exercise it seemed to depend on can- 
celling out the factor k from k!. But here we seem to need to remove a factor 
k? from k!. It’s not clear how to do that. 

After some meditation, we decide to find a related quantity, namely 
E[S, (S, — 1)]. 

By Theorem 


3 


| 
= _ = aa ko 
E [Sn (Sn — 1)] = yoKt 1)P(S, = k) = )p in — By 
- n! 
k=2 


since the k = 0 and k = 1 terms are zero. So 


n!\ 


E[S;, (Sn — 1) = eM ime 
a (n — 2)! 
-Yor'g LE = ne yp = ky 


We notice that (n — k)! = ((n — 2) — (k—2) )!. This suggest replacing k — 2 
by 7 in the sum. We obtain 


= ! ae 
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Since the range of S,,_» consists of the numbers j = 0,1,...,2 — 2, we know 
that 


n—2 
P59) Si, 

j=0 

Hence E[S,, (S;, — 1)] = p?n(n — 1). That is, E[S?] — E[S,,] = p?n(n — 1). 


E [$2] = E [Sn] + p’n(n — 1) = pn + p’n® — p'n = p(1— p)n + pen’. 
By equation (16.9), 
Var (S;,) = E [$2] — (E[S,])” = p(1 — p)n. 


Example 16.11 (Variance of a geometric distribution). Let T be the 
time of first success in co Bernoulli trials, when the success probability on 
each trial is p. 

T is defined in section We will find Var (7). 

First note that by equation |13.18} 


To get more information, we’re going to use the trick of differentiating a 
series term-by-term. 

We used that trick (for a finite series) in one of the derivations of the 
expected value of T;, (see equation (13.8)). 

In Exercise [13.4] we used the differentiation trick as one method to find 
E[T}. 

Now we want to use the same trick here, to get E [T°]. 

Using the formula for the sum of a geometric series, we know that for 
x € (-1,1), 

pop altatattatattot... 

Differentiating term-by-term, 
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Differentiating term-by-term again, we have 


ai i a ae 
—2Z 


Setting x = q we have 
? . 2 
BT 1+ 3-2g 44-377 +5-4q? +...= Db a(R Dah : 


Then se 
ae So K(k —1)q 
k=1 


because the first term of this series is zero. Multiplying by pq, 


= om 1) “p= SoM 1)P(T = 8) 


By Theorem|14.11|(which is the generalization of Theorem to the countable- 
range case), 


2q _ _ = 2) = 2) 
a eee 1)] E [T?| E [T| E [T"] - 
Thus 
B[M] = 3+ 
P P 


By equation (16.9), 


2 oe L _2¢-p=—l. gore 4g 
p 2 p? o p 


Var (T) = 


Exercise 16.7 (Variance of an exponential waiting time). Let T be 
the exponential waiting time (Definition |15.1). Find Var (T). 
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Exercise 16.8 (Shifting preserves variance). Prove that for any real 
number c, 
Var (X +c) = Var (X). (16.14) 


For every nonnegative integer n, by linearity 
B((cX)"] = "E[X], B[leX|"] = |el" B(|X]). (16.15) 
Exercise 16.9 (Scaling the variance). Show that 
Var (cX) = |c|? Var (X). (16.16) 


Which is more a meaningful measure of deviation: E[|X — y|| or the 
variance, E[(X — ,:)?]? 

Either one can be larger than the other. Which is more significant in 
a practical situation may depend on whether you consider that a few large 
deviations should be regarded as more important than a large number of 
smaller deviations. 

We will see later that Var (X) is the most useful measure of deviation 
for theoretical purposes. 


Example 16.12 (A case where E [X”] = (E[|X|])*). Figure|16.1)and/16.2| 
show examples of random variables X,Y on Q = [0,1]. The probability on 


(2 is assumed to be uniform. 

You can check that E [|X|] = E[|Y|], E[X?] = (E[|X]])’, and E[Y?] > 
(E[IY |). 

X is such that equality holds in equation (16.10). Notice that |X| is 
constant. Using equation (16.9) and Remark can show that this is 
not a coincidence. 
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2.04 X 


—-2.0 4 
Figure 16.1: Unusual case: square of centered absolute first moment equals 
the variance. 

Exercise 16.10. Prove that 
Var (X) — Var (|X|) = (B[|X1])? — (B[X))?. (16.17) 


We showed long ago, in equation (10.31), that 
|E[X]| <E[|X]]. 
Using this fact and equation (16.17) gives us another inequality: 


Var (|X|) < Var (X). (16.18) 


Exercise 16.11 (A minimum property for the variance). Let X be 
any random variable such that E[X?] exists, and let c be any real number. 
Let p = E[X]. After writing X — cas (X — w) + (ts —c), show that 


E [(X — c)?] = Var (X) + (uc)? (16.19) 
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0.2 0.4 i 0.6 : 0.8 1.0 


(a) Y 


Figure 16.2: Typical case: square of centered absolute first moment less than 
variance. 


Notice that equation (16.19) tells us that the mean square deviation of X 
from c is smallest when c = yu. If you must describe the distribution of X by 
a single number, this suggests that p is the best choice. 


Exercise 16.12. Let X, and X2 be independent random variables with the 
same distribution. Suppose that E [X?] exists. Prove that 


E [(X1 — X2)*] = 2Var (Xi). (16.20) 


16.3. The Chebyshev Inequality 


We can use the Markov inequality (Lemma/12.12) to estimate the probability 
that a random variable deviates from its mean, as follows. 
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Lemma 16.13 (Chebyshev’s Inequality). Let X be a random variable 
such that the mean and variance of X exist. Let 4p = E[X] and let o = 
\/ Var (X). Then for any real number a > 0, 


2 Var (X) 


= a2 


P (|X —E[X]] >a) (16.21) 


Proof. Let Y be the square of the deviation from the mean, i.e. Y = 
(X —E[X ie By the Markov inequality, 


a’P (Y >a’) <E[Y]. 


Since {Y >a?} = {|X —E[X]| >a} and E[Y] = E[(X —E[X])’], this 
gives equation (16.21). 


Remark 16.14 (Chebyshev and the search for Charlie). Now that 
we have stated the Chebyshev inequality, we can see that Exercise isa 
typical application of that inequality. 

Recall that in the solution to Exercise we applied the Markov in- 
equality to obtain an estimate for P(|S,,| > 500). 

This is the same as estimating P(S? > 250000), and using the Markov 
inequality we had: 
E [S?| 


P((|S,,| > 500) = P(S? > 250000) < 


(16.22) 


Let « = E[S,,]. In this problem E[S,,] = 0, so Var (S,,) = E[S?], and 


Thus equation (16.22) is exactly the estimate given by the Chebyshev in- 
equality: 
Var (S;,) 


P(|S,, — | > 500) < 
(|Sn — HI 2 500) < Seong 
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With the usual notation of js for E[X] and o for the standard deviation 
of X, o? is the variance of X. One often writes Chebyshev’s inequality as 


2 
P (|X —| >a) < = (16.23) 


2" 


Equation (16.23) suggests that it might be useful to measure deviation 
from the mean in units of the standard deviation o. Thus we can rephrase 
the Chebyshev inequality as: 


1 
P (|X —p| >co) < s (16.24) 
c 


= 
Note that the estimate in this equation does not depend on the value of a. 


Exercise 16.13. Let W be a random variable such that P(W = 3) = 1/18, 
P(W = —3) = 1/18, and P(W = 0) = 8/9. Find the mean and standard 
deviation of W. Find the probability that W deviates from its mean by at 
least three standard deviations. 

Note that for this particular random variable, the probability you found 
is not very small. 

Compare your answer with the estimate obtained using the Chebyshev 
inequality. 

Now repeat these steps for the probability that W deviates from its mean 
by at least 3.1 standard deviations 


Exercise 16.14. Let Y be a random variable on |—2, 2] defined by Y(t) = t. 
With P equal to the uniform distribution on [—2, 2], find the mean, variance 
and standard deviation of Y. 

Also find the probability that Y deviates from its mean by at least three 
standard deviations. 
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16.4 Covariance of two real-valued random 
variables 


We observed in Section {16.1]that calculating the moments of a random vari- 
able can help us to understand its distribution. When dealing with two 
random variables X,Y, we can of course calculate the means, moments and 
centered moments of X and Y separately. But it is also useful to have quan- 
tities which tell us about the relation between X and Y. One such quantity 
is E|XY]. For example, if E[XY] 4 E[X]E|Y] we at least know that X 
and Y are not independent. 

We can learn more by calculating the mean of the product of the centered 
random variables, which is referred to as the covariance of X,Y. 


Definition 16.15 (Covariance of two random variables). For any ran- 
dom variables X,Y, the covariance of X,Y is denoted by Cov (X,Y), and 
is defined by 


Cov (X,Y) = E[(X — E[X]) (Y —E[Y))], (16.25) 


provided that the expected values exist. 
As usual, if E[|X] = pw and E[Y] = v, we can write 


Cov (X,Y) =E|(X — 1) (Y —v)], (16.26) 


Remark 16.16 (Existence of Cov (X, Y)). The comparison principle can 
be used to show that Cov (X,Y) exists if E[X], E[Y] and E [XY] exist. 
By Lemma|16.4} E[XY] exists if E[X?] exists and E[Y?] exists. 
Assuming E[X] and E[Y] exist, E[X?] exists if and only if E[(X — p)?] 
exists. Thus Cov (X,Y) exists if Var (X) exists and Var (Y) exist. 


Since 
Cov (X,Y) = B[(X — w) (¥ -»)] = BUX (Y -)] - pBIY -), 
and E[Y — v] = 0, we have 
Cov (X,Y) =E[X (Y —v)]. (16.27) 
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Similarly 
Cov (X,Y) =E[(X — pw) Y]. (16.28) 
Thus it is only necessary to center one of the two random variables when 
calculating the covariance. 
From the definition of covariance, 
Cov (X, X) = Var (X). (16.29) 
Much as in Exercise |16.8} shifting random variables by constants has no 
effect on covariance: 
Cov (X — a, Y — b) = Cov(X,Y). (16.30) 
Writing variance as mean square minus square mean is often useful. Co- 
variance has a similar property. 


ee 16.15 (Mean product minus product mean). Generalize Ex- 
ercise [16.4] by proving that 
Cov (X,Y) =E[XY] —E[X]E[Y]. (16.31) 


Lemma 16.17 (Expanding the variance of a sum). Let X and Y be 
any random variables such that E [|X], E[Y] exist. If Var (X) , Var (Y) exist, 
then Var (X + Y) exists. Furthermore Cov (X,Y) exists, and 


Var (X + Y) = Var (X) + 2Cov (X,Y) + Var (Y). (16.32) 
Thus 2Cov (X,Y) is the “cross term” in the expansion of Var (X + Y). 


Proof. We have already discussed existence. 
The algebra is routine: 


Var (X +Y)=E[(X+Y —(u+v))"] =E[((X-—p)+(¥ -v))"] 
=E[(X —p)?] + 2E[(X —p)\(¥ -v)]+E[(¥ -)’] 
= Var (X) + 2E|(X — y)(Y —v)|+ Var (Y) 
= Var (X) + 2Cov (X,Y) + Var(Y) (16.33) 
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The random variables X,Y in equation (16.32) could be independent. In 


that case their covariance is zero. 


Lemma 16.18 (Independence implies zero covariance). Let X,Y be 
any independent random variables such that EX] exists and E|Y] exists. 
Then Cov (X,Y) exists, and Cov (X,Y) = 0. 


Proof. By additivity, E[X — 4] exists and E[Y — v] exists, and these expec- 
tations are zero. 

By Lemma [12.7] X — ys and Y — vy are independent. 

By Theorem|12.8} E[(X — p)(Y — v)] exists, and E[(X — p)(Y —v)] = 
E[X —yJE[Y —o] =0-0=0. 

And by definition Cov (X,Y) = E[(X — w)(Y — v)]. 


Corollary 16.19 (Additivity of variance for independent). Let X and 
Y be independent random variables whose variances exist. Then the variance 
of X + Y exists, and Var (X + Y) = Var (X) + Var (Y). 


Proof. Apply Lemma|16.18]to equation (16.32). 


Life would be simpler if the converse to Corollary |16.19| were true. How- 
ever, it ain’t. 


Example 16.20. Consider tossing a fair coin twice, and then rolling a fair 
die. 
Let G = 1 if the first toss gives success, and G = —1 otherwise. 


Let H = 1 if the second toss gives success, and H = —1 otherwise. 
Let K be the result of rolling the fair die, so P(K = i) = 1/6 for i = 
elo. 


It is clear physically that G,H,K is an independent sequence of random 
variables. 
Notice that 
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Similarly 


P(HK =1)=P(H =1)P(K =6) =—. 


However, 
P(HK =1)/GK =6) =90, 


since the result of the roll of the die cannot be both equal to 6 and equal to 
1. 

Thus GK and HK are not independent. 

On the other hand, E [GK] = E|[G] E[K] = 0, and E/Hk] = E|A] E[K] = 
0, so 


Cov (GK, HK) = E|GHK’| = E[GH]E[K’] = E[G|E[H] E[H’] =0. 


Although covariance zero does not imply independence, we often think 
of covariance as a rough measure of the degree of dependence between two 
random variables. 

Here’s some terminology. 


Definition 16.21 (Uncorrelated random variables). If Cov (X,Y) ex- 
ists and Cov (X,Y) =0 then X,Y are said to be uncorrelated. 


The word “uncorrelated” is also used in ordinary language, with a less 
precise meaning. As usual, one must judge what is meant from the context. 


The next lemma extends Lemma 16.17} with the same proof. 


Lemma 16.22 (Expanding the variance of a sum of n random vari- 
ables). Let X1,...,X,, be real-valued random variables. Assume that the 
mean ji; of each X; exists, and the variance Var (X;) of each X; exists. Let 
Oye i ee 

Then Var (S,,) exists, and 


nm nm 


Var (Sn) = 5) 5 > Cov (Xi, X)), (16.34) 


i=1 j=l 
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Of course Cov (X;, X;) = Var (X;) for each 7 (equation (16.29)). 
Ifit happens that X1,...,X;, is an independent sequence, then Cov (X;, X;) = 
0 for alli #7, and 


Var (S,,) = Var (X;) +... + Var (X). (16.35) 


Exercise 16.16 (Variance of a binomial random variable revisited). 
Use equation (16.35) to prove equation (16.13) in an efficient manner. 

As in on start by writing S, = X,;+...+ Xn. 

Your goal is to show that 


Var (S,,) = n(1 — p)p. (16.36) 


16.5 Using Chebyshev 


In order to use Chebyshev’s inequality for a random variable, we need to 
know the variance of the random variable. The second moment of a random 
variable involves the square of the random variable, so calculating a variance 
is unlikely to be as easy as calculating a mean. However, equation 
helps with that. 

You will remember that Lemma |12.11]showed us how to calculate E [S?] 
for random walk. In that lemma, E[X;] = 0 for each i, so E[S,] = 0 and 
E [S?] = Var (S,,). Also we had Var (X;) = 1 for each i, and we obtained 
equation using exactly the argument that was used to get equation 
(16.35). 


16.6 ‘The Weak Law of Large Numbers 


We mentioned earlier that the frequency interpretation of probability does 
not tell us how many repetitions of an experiment are likely to be needed in 
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order to estimate a probability value using an average value. More gener- 
ally, the frequency interpretation of expected value (Fact has the same 
deficiency. Theorem in this section allows us to make these frequency 
statements a little more precise. 

Let X be a mathematical random variable which represents some property 
of an experiment. Let X1,...,X, be independent random variables with the 
same distribution as X, and let S,, = X,+...+X,. Then S,,/n represents the 
average measured value for the property in n repetitions of the experiment. 
We would like to know whether E[X] is a reliable estimate for the average 
measured value of the property. In other words, how likely is it that S,,/n 
deviates significantly from E [|X]? 

The next theorem attempts to answer this question, with “significantly” 
interpreted as “by more than e” and “likely” expressed as a probability. 


Theorem 16.23 (The Weak Law of Large Numbers). Let X,,..., Xp 

be independent random variables on some sample space, such that each X; 

has the same mean yp and the same standard deviation 0. Let S, = X + 
.. + X,. Then 


2 
Pe | > | < a (16.37) 


Proof. For each i, Var (X;) = 0°. 
By equation equation (16.35), Var (S,,) = no. 


Thus Var (S,,/n) = (1/n7)Var (S;,,) = 0?/n (using equation (16.16)). 
Chebyshev’s inequality (equation (16.21)) then gives equation (16.37). 


When using equation (16.37), it is important to remember that in o is 
the standard deviation of each X;, not the standard deviation of the random 
variable S,,/n. 

The variance and standard deviation of S,, are given by 


Var (5,,) = no”, \/ Var (S,) = V/no, (16.38) 


sO 
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As we have already observed, when the deviation of a random variable is 
expressed as a multiple of the standard deviation of the random variable, 
Chebyshev’s inequality does not depend on the size of the standard deviation. 
Thus if we rewrite equation with the deviation expressed in units of 
the standard deviation of S,,/n, the resulting estimate does not depend on 


S, a 1 
P({|—- —})<— 16.40 
(F ul >aFe) So ee) 
or equivalently 
1 
P (|S, — np] > aV/no) < ae (16.41) 


The Weak Law of Large Numbers is important theoretically, since it re- 
moves some of the vagueness from the frequency interpretation of expected 
value. For practical purposes, one usually finds that the bound for the prob- 
ability of error which is given in equation is not very precise, i.e. the 
actual probability is considerably smaller. 

We mentioned earlier that there is another mathematical law of large 
numbers, called the Strong Law of Large Numbers. From the name one 
might hope that the Strong Law would give a better estimate. Unfortunately, 
although the Strong Law is indeed stronger for theoretical purposes, it does 
nothing to improve the probability estimate. We need another approach, 
such as the Central Limit Theorem ({I]]). 


16.7 Bilinearity of covariance 


If you wish to calculate the covariance of random variables which are given 
as algebraic expressions in terms of other random variables, you can always 
use the definition of covariance in terms of products. However, it may be 
simpler to use the algebraic properties of covariance directly. 

The general concept of a linear operation was defined in Definition [10.12] 
An operation is linear if one can take sums “through” the operation, and one 
can also take multiplication by a constant through the operation. Now we 
introduce the general concept of a bilinear operation. 


Definition 16.24 (Bilinear operations). A bilinear operation is an oper- 
ation on two elements x and y which depends linearly on x when y is fixed, 
and depends linearly on y when x is fixed. 
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Covariance is defined in terms of expectations of products, and it is bilin- 
ear in the sense of Definition [16.24] That is, Cov (X,Y) is a linear function 
of X when Y is held fixed, and Cov (X,Y) is a linear function of Y when X 
is held fixed. We state this formally in the next lemma. 


Lemma 16.25 (Covariance is a bilinear function). Covariance is bilin- 
ear, so that for any random variables X,Y, Z, and any numbers ¢1, C9, 


Cov (c,X + @Y, Z) = «Cov (X, Z) + coCov(Y, Z), and 


16.42 
Cov (Z,¢,X + @Y) = c,Cov (Z, X) + Cov (Z,Y). ( ) 


Of course the second equality in equation (16.42) is redundant, since 
covariance is clearly a symmetric operation: 


Cov (X,Y) = Cov (Y,X) (16.43) 


Exercise 16.17. Prove the first equality in equation (16.42). 


Multiplication of two numbers is a simple example of a bilinear operation. 
The “bilinear” property in this case is just another way of describing the 
distributive law. Our familiarity with the algebra of numbers makes it easy 
for us to use the bilinear property for other operations, such as covariance. 


Example 16.26. Let’s look at the algebra needed to derive equation (16.34). 
Using the definition, 


Var (S;,) =E (>: C= «)) 


i=1 
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Applying the distributive law, 


Var (S;,) = E Sw O8-1)| =D EL (% — i) (% - mi), 


i=1 j=l 


sO 
non 


Var (Sn) = > S > Cov (Xj, Xj). 
i=1 j=1 
If we want to make use of the bilinear property of covariance, we would write 
the same manipulations a bit differently. 


Var (S,,) = Cov (S,,, Sn) = Cov (>: x, >) 
i=l j=l 


Then, using bilinearity, again we arrive at equation (16.34): 


Var (Sn) = > S° Cov (Xj, Xj). 


i=1 j=l 


Using bilinearity is bit shorter, but not dramatically so. Still, you will likely 
find that de-cluttering equations helps to clarify your work. 


Exercise 16.18. Let X,Y,Z be independent random variables, such that 
Var (X) = 1, Var (Y) = 2 and Var (Z) = 3. 
Calculate Cov (5X —Y+2Z,X +3Y — Z). 


Example 16.27 (Variance of a hypergeometric random variable). 
Let Ly .x«» be the random variable defined in Definition so that Ly Kn 
has a hypergeometric distribution with parameters N, k,n. 

The experiment consists of randomly selecting n objects for a set S of N 
objects, where a certain target set T of A objects has been specified. Ly Kn 
is the number of selected objects which lie in the target set 7’. 

We have already dealt with E [Ly,«.n] in Section [10.5.2 using Method 1 
of that section. 
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We found that - 
E[Ly,Kn] = 1Z- 
Consider the case that n = 1 in this equation. Since E [Ly ,x,.;] is either zero 
or one, x is also the probability that when a single element is randomly 
selected, it will lie in the target set T’. Let us denote this probability by p. 
We will say it is the success probability when choosing a single element. 


In the present example we wish to calculate Var (Lyx) for all n. The 
properties of the covariance function will allow us to do that fairly efficiently. 

For each member o of S, let X, be equal to 1 if o is in the selected subset 
of n elements, and X, = 0 otherwise. By symmetry, E[X,| is the same for all 
o and Var (X,,) is the same for all o. Let E[X,] = w and let Var (X,) = v. 
Of course Cov (X,, X,) = Var (X,) =v. 


Let 
Ge S° Xo. 


o€S 


From the description of the experiment, Z is constant and Z = n. Hence 
E[Z] =n (and Var (Z) = 0). 
It follows that E[Z] = Nu, so Nu =n, so 


pa (16.44) 


Since the possible values of X, are 0 and 1, we see that P(X, = 1) =p 
for each o. 
Since Z is constant, Var (Z) = 0, i.e. Cov (Z, Z) = 0: 


Cov (x xX] =0. 
o€ES TES 


Expanding using bilinearity, this gives 


S° Cov (X,,X;) = 0. 


o,TES 


Grouping like terms, 


"Cov (X,,Xo) + > Cov(X,,X;) =0. (16.45) 


o€S o,TES, oAT 
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Since E |X,] = pu, we know that P(X, = 1) = py, so 
vo = Var (X,) = E [X97] — (E[Xe])’ =n - w= wl - py), 


as usual. Thus 
S| Cov (X,, Xo) = Nu(1 — 1). 


o€S 


By symmetry, Cov (X,, X,) is the same for any 0,7 with o £ rT. Call this 
number c. There are N(N —1) choices for 0,7 with o 4 7 (N ways to choose 
o, and then, for that 0, N — 1 ways to choose 7). Hence we have 


S> Cov (X,,X;) = N(N = Ie. 
o,TES, oAT 


Substituting in equation (16.45), 


Nu(1—p) + N(N —1)c=0, 


and so a 
H(i — 
= ——___, 16.46 
c a (16.46) 
From the definitions, 
Lxca = S- P.O 
o€T 


Since Var (Ly xn) = Cov (Ly xn, Ly.x.n), by expanding and grouping like 
terms we have 


Var (Ly,kn) = ) Cov (X_,Xo)+ > Cov (Xz,X;). 


o€T 0,TET THO 
Thus 
Varn e e  i = 
r n) = == = = = 
NK, bb bt N-1 iad a N-1 
n(N-—n) (N-K KN-K (N-n N-n 
N2 (5) N N n(3=*) mp (5) 
Thus we have 
N— 
Var (Lv.xn) = not ~p) (5 * (16.47) 
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Incidentally, this equation shows that for n > 1, 
Var (Ly.Kn) < np(1 — p). 


Notice that np(1 — p) = Var (S;,), where S,, is the number of successes in n 
coin-tosses, when the coin has success probability p, 

This gives an answer to the question posed at the end of the solution for 
Exercise which asks why the graph of the hypergeometric distribution 
should look narrower than the graph of corresponding binomial distribution. 


Exercise 16.19 (Negative covariance). By equation (16.46), in Exam- 


ple |16.27| we have 
Cov (X,, X,) < 0 


igt any oT. 
For simplicity, consider the case that N = 2n, so that pw = 1/2 by equation 


(16.0). 


Then X, — wp = +1/2, and 


1 1 
Pi xX, =p] —-)=PiA. =) Sj =: 
(x,-u=5) =P =D=5 
Similarly X, — p = +1/2, and 
1 | 
Bison SP aS 
(x.- n= 5) =PO>=0)=5 


Since Cov (X,, X,) = E[(X, — “) (X, — p)] < 0, it is easy to see that 
Cov (X,, X,) = —(1/4)P(X, 4 X,) + (1/4)P(X, = X,). (16.48) 


For o #7, we showed that Cov (X,, X,) = c < 0, so equation 
shows shows that P(X, #4 X,) > P(X, = X,). 

In this problem try to give a more physical explanation for the fact that 
P(X, # X,) > P(X, = X-,). 
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16.8 Solutions for Chapter 
Solution (Exercise 16.1). (i) By equation (15.3), 


1 


a el 
E[X] -|/ rat = lim f — dt = lim log(t) 
0 t aos, t a0 P 


= lim (log 1 — log a) = co 


Thus E[X?] does not exist. 
(Remember that in our work, log means logarithm to the base e.) 


(ii) 
x= [5 oe dt = im f vi dt = 


‘1 aa 
E [X?| -|/ rat = tim f — dt = lim logt 
0 t aN0 Ja t a\,0 i 


Thus E[X?] does not exist. 


i] = tim2 (vi - va) = 2. 


eee = 
a: og a = oO 


(iii) For any positive integer k, 


1D [x"] er ~ - a 
n=1 n=1 


By the integral test, this series converges if and only if 
acer dx exists. 
1 


| ed 
[sree fim peak 


a oe ee) 
~ pmod—-k\ BR) 4k 


Also 


Thus E [X*] exists for k < 4. 
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When k = 4, 
b b 
dim =: = jim log (x) = im (log(b) — 0) = co. 


Thus E [X*] does not exist. 


Solution (Exercise |16.2)). Since (x — y)? > 0, x? — 2cy + y? > 0. Rear- 
ranging gives 
Qry <a? +y’. 


Also 
(e-yy aa ty? + 2ey <o? y +e? +? = 207 + Oy’. 


Solution (Exercise |16.3). First method This is just the statement of 
Lemma with n = 2. 


Second method By assumption, E[X?] exists. 
Since 1 is a bounded random variable (it’s even finite-range), E |1] exists. 
By Lemma/16.4} with Y replaced by 1, E[X] exists. 


Solution (Exercise [16.4). 
Var (X) = E [(X — p)?] = E [X? — 2uX + p?] = E[X?] — 2uE[X] +2? 
= B [X?| — 2p? +? = B[X?| - p’. 
Solution (Exercise [16.5). 
E[X] = E [7] =7. 
Var (X) = E [(X —7)] =E[(7—7)7] =0. 


Solution (Exercise |16.6). 


_ flee a. 28 
==5(5 =. 
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By equation (16.9), 
25. 25 = 25 
Var (X) = — -— = —. 
i ee ne a 
Now we generalize. Let X have a uniform distribution on |s,t]. An easy 
calculation shows that E[X] is the midpoint, ie. w = (s+ t)/2. 


Also 


t 


i] es | 


Var (X)=B[(X-w)?] = f (w- 1); du 


= >, 1 \8 
—s rae H) : 
1 


= 53 (te wy - (3 — ). 
Of course t — p = (t — s)/2 and s — p = —(t — s)/2. Thus 
Var(x)= tt) _t —) (16.49) 


Solution (Exercise {16.7}. Let J’ have an exponential distribution with 
parameter 4. 


[Re 


By equation (15.7), 
1 
E|T| = —. 
In| =} 
By the solution to Exercise 
2 
2 
E[?| = e 
Hence by equation (16.9), 
Var (T) = 55- (16.50) 


Solution (Exercise [16.8). Let p = E[X]. 
E[X —c] =E[X]-E|c] =p—-c. 
Var (X —c) = E[((X —c) — (uw —0))?] = E[ (X — p)?] = Var (X). 


Solution (Exercise |16.9}). Let ps = E[X]. 
By linearity, 
E [cX] = cp, 


and so 


Var (cX) = E[ (eX — cu)? ] = E[?(X — ps)? |] = CE | (X — p)?] = Var (X). 
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Solution (Exercise |16.10). By equation (16.9), 


Var (X) = E[X?] - (B[X])’, 


Var (|X|) = E[|X|?] - (E[|X1])” =E[X?] - (B[IX1])”. 
Subtracting these equations gives equation (16.17). 
Solution (Exercise {16.11). 
EB[(X-0)?] =E[(X-#)+(@-)] =E[(X - 4)? + 2X —w)(u—o) + (e- 0”). 
Since B[2(X — y)(u—¢)] = 2(u— c)E[X — p] =0, 

E[(X —¢)?] =E[(X —p)*] + (u— 0)? = Var (X) + (u— 0)’. 
Solution (Exercise [16.12). Let E[X;] = 1. Using Theorem|[12.8] 


E [(X — X2)*] = E[X7 — 2X1X2 + X3] 
= E[X}] — 2E[%q] E[X2] + E [XQ] = 2(E[X7] — 2’). 


Apply equation (16.9). 


Solution (Exercise |16.13). 


1 1 
E = — may om —0= 
Wie tas og 09 
Var (W) =E[W?] = 9+ 49=1 
18° 18 


Thus the standard deviation o for this random variable is one. 
The probability that W deviates from its mean by at least three standard 
deviations is 


1 
P(|W| > 30) = P(|W| > 3) = P(W =3)+ P(W = -3) = Oy 
Using equation (16.24), the Chebyshev estimate is 


P(|W| > 30) < 


Ole 


A perfect estimate! 
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The probability that W deviates from its mean by at least 3.1 standard 

deviations is 

PW) > 3.1e)}= PW) S31) =0. 
Using equation (16.24), the Chebyshev estimate is 
1 
P(|W| > 3.10) < 561° 
Not so good. 
Solution (Exercise (16.14). The mean of uniform distribution on an in- 

terval is the midpoint, so = E|[Y] = 0. 

By equation (16.49), the variance of a uniform distribution on an interval 
is one-twelfth of the square of the length, so Var (Y) = 16/12 = 4/3, and 
the standard deviation of Y is ¢ = 2/V3. 

Notice that 30 = 23 > 2. 

Thus 

P([Y — p| 2 30) =0. 


Solution (Exercise |16.15). 
B[(X — p)(¥ —»)] = E [XY] — WEY] - vB[X] + pw 
=E[XY] — py — py t+ py = E[XY] - pv. 
Solution (Exercise |16.16)). By equation (16.35), 
Var (S,,) = Var (X1) +...+ Var (X,). 
By equation (16.12), 
Var (Sn) = np(1 — p), 

in agreement with equation (16.13). 


Solution (Exercise |16.17). We must prove that Cov (c.X + @Y,Z) = 
c,Cov (X, Z) + coCov (Y, Z). 
Let y = E[Z]. To save some writing we can use equation (16.27). By 


equation (16.27), 


Cov (c.X + @Y,Z) = 


— 


+C2¥') (4 =) 

X(Z—4) + oY (4-4) 
X(Z—)|+E[eY (Z—-7)] 
X (Z = ¥))+ @E[Y (2 —7)| 
ov (X, Z) + coCov (Y, Z). 
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Solution (Exercise [16.18). 
Cov (5X —Y4+2Z,X+4+38Y —Z) 

= sav (XX ay = 7) =Cov (YX 4 = 7) lov (7, X SaY —7Z). 
By independence, Cov (X,Y) = 0, Cov (X,Z) = 0, Cov (Y, Z) = 0. Thus 

Cov (X, X + 3Y — Z) = Cov (X, X)+0+0=1, 

Cov (Y, X +3Y — Z) = 3Cov (Y,Y) =6, 
and 
Cov (Z, X + 3Y — Z) = —Cov (Z, Z) = -3. 

Hence 

Cov (5X —Y+2Z,X+3Y —Z)=5-6-3=-4. 
Solution (Exercise [16.19). We know that 


n 1 
P(X, = 1) = P(X, =) = = 5: 

We can think of the random selection of the n points as taking place in 
two steps. In step 1, it is randomly decided whether or not @ is selected. 
After that, in step 2, points in S — {a} are randomly selected (either n — 1 
points or n points, depending on whether or not a is selected). 

To find P(X, = 1| X, = 1), notice that since o is selected, the remaining 
selection step consists of randomly selecting n — 1 points from S — {a}. The 
chance of selecting the particular point 7 in this step is (n — 1)/(N — 1). It 
is easy to check that for n < N we have 


n—1l 
N-1~N 
Thus 
P(X, =1|X,=1)< . 
Hence 
POC 01%.=7)5 . 


A similar argument shows that 
1 
PX = 1/4, 30) > oa 
It follows by the Law of Total Probability that P(X, 4 X,) > 1/2. 
The fact that P(X, # X,) > 1/2 is what we wanted to explain. 
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Poisson random variables 


Poisson random variables are used in many applications, and have a surpris- 
ingly elegant theory. 
17.1 A limit for powers 


A standard calculus fact is the following. 


1/t 
lim (1 + ? =e. (17.1) 
t0 


The limit in equation (17.1) describes what happens as ¢ approaches 0 but 
is not equal to zero. Since 1 +t > 0 when ¢ is close to 0, the expression 


‘el + t) ME in the limit makes sense. 
To prove equation (17.1), we can use L’H6pital’s Formula: 


L/t 
sal | __ log(1 +t?) 
lim los (1 + ? ) lim ; log(1 t) lim 


t30 £ 


Here’s a handy variation on equation (17.1). We will apply this lemma 
in the next section. 


Lemma 17.1 (Exponential limit for powers). Suppose that a, — oo, 
and b,a, — z for some number z. (z can be any real number.) Then 


lim (1+ b,)"" =e”. (17.2) 


n—->oco 
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Proof. There are two cases. 


Case 1 Suppose that z 4 0. 

Then for large n, aj,b, is nonzero, so 6, is nonzero. Also, since a,b, > z, 
bn = (Gnbn)/an > 0. 

By equation (17.1), 

Gitbay he 
Then, since a,b, > z, we have 
Gnby 
(1+b,)" = ( (1+ ae = e*. 

This finishes the proof for the case that z 4 0. 


Case 2. When z = 0, it is possible that 6, = 0 holds for infinitely many 
values of n. That means that throwing around expressions like 1/b, in the 
limit is rather obnoxious. 

But we can fix that by breaking up the sequence b,, into two subsequences. 
Let b,, be the subsequence consisting of all elements b,, which are nonzero. 
If there is an infinite subsequence b,,,, we handle that just as in Case 1: 


(1+ by — @, 
by equation (17.1), and 
An, On 
(1+ bn)" = ( (1+ Png)" ) ko"k ae 


That takes care of the subsequence b,,. The rest of the sequence b,, is just a 
sequence of zeroes. For the elements b,, in that subsequence, 


(1+6,)" =1% =141=e’. 


So equation (17.2) holds for the whole sequence. 


In Appendix [Gp, a different proof of Lemma is given, using inequal- 
ities for the exponential function. 
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17.2 The frantic flipper and the Poisson ap- 
proximation 


In many experimental situations, an observer records the arrival of a mes- 
sage. For example, the “message” could be a click of a Geiger counter, or a 
telephone call to a sales office, or a request to a computer server. A natural 
random variable in this situation is the number N of messages that arrive 
during a given time interval. In the present chapter we will derive a formula 
for the distribution of this random variable. 

Physically, we are thinking of a situation in which there are many inde- 
pendent sources which randomly send a message to the observer. Although 
there are many sources, we assume that each source emits messages at a low 
rate, so the total number of arriving messages is not unbearably large. 

In the case of clicks of a Geiger counter, the clicks are caused by decay 
of atoms in a sample of radioactive material. Each atom has only a small 
chance of decaying during a given time interval, but there are many atoms 
in the sample. In the case of telephone calls to a sales office, there are 
many potential customers in the population, but for any particular potential 
customer, there is only a small chance that the customer will be motivated 
enough to call. And so on. 

In order to have get an definite formula, we will consider a very familiar 
situation: tossing a coin. Imagine that each source of the message is tossing 
a coin to decide whether or not to send a message. The coin has a very 
low success probability, but there are many sources, and they are all tossing 
coins. We will use this picture in our derivation, but change it slightly. 

Instead of many tossers, imagine that we have a single coin tosser, who 
tosses very rapidly, and sends a message every time the coin toss brings 
success. That seems easier to think about, and should result in the same 
type of formula. 

Suppose the coin is tossed n times, where n is large. If the coin were fair, 
the tosser would almost certainly have an enormous number of successes, and 
the number of messages would be hopelessly large. 

However, the coin has a very small success probability p. 

We are interested in finding the distribution of the random variable N 
which records the number of successes. 

Of course, everything depends on how large n is, and how small p is. 
Suppose these numbers are such that np is approximately equal to a number 
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A which is of “ordinary” size. In this case we may be able to give an estimate 
for the probability distribution of the successes which the tosser will obtain. 

Our estimate will apply when n is large, so let’s analyze the situation by 
finding a limit as n — oo. 


Lemma 17.2 (The Poisson approximation to the binomial). Consider 
a sequence of experiments. These experiments are not repetitions of the same 
experiment. Instead, in experiment n, the tosser records the result of n tosses, 
using a coin with success probability py. 
Assume that 
lim np, = A, (17.3) 
n> co 
for some number 4. 
Let the random variable S,, be the total number of heads obtained during 
the experiment with n tosses. Then: 


2s 
lim Pl 5, = %) 


=—e 


(17.4) 


As usual, 0! = 1 in this formula. To include the special case \ = 0 in the 
formula, we also use the standard convention that 0° = 1. 


Proof. Let us think first about the result when no heads are obtained: k = 0. 
During experiment n, the probability of failure on a toss is (1 — p,). And 
S;, = 0 means all n tosses in experiment n gave failure. Thus 


lim P (S, = 0) = lim (1—p,)” =e, (17.5) 


n> Co n—->Co 


using Lemma 17.1 Note that equation (17.5) agrees with equation (17.4). 


From now on we consider k > 0. 
Using the binomial distribution, 


lim P(S,, =k) = lim @ pk (1—pa)”*. (17.6) 


n—-0oo noo \k 


Here k is a fixed positive integer, while p,, is approaching zero in such a way 
that npn > ». 
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We'll look first at one part of the expression in the limit. Note that 


(j,)eh = gunn 1). (n—k+1)ph = G (1-2)... (1-S*) (np) 


Remember that k is fixed as we let n + oo. Each factor 1 — “ converges to 
one, so 


k 
lim GE: = (17.7) 


Now let’s get back to evaluating the rest of the expression in equa- 
tion (17.6), i.e finding lim,_,.. (1 — Pn) *. Notice that 


lim pp(n — k) = lim ppn — jim O Pnk = =-0=.X. 
Noo N—->0o 
By Lemma /|17.1} with b, = py, and a, =n—k, 
Gp)" Se (17.8) 


Combining our facts, we have shown that equation (17.4) holds. 


Exercise 17.1. Prove that those probability limits in equation (17.4), for 
k =0,1,..., add up to 1. That is, prove that 


S° ve = (17.9) 


k=0 


You can use the power series expansion of the exponential, namely 
ae = tv 
=), a (17.10) 


(Notice that we use the convention 0° = 1 in this power series, when evalu- 
ating e°.) 
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Exercise 17.2. Suppose that you wish to verify the power series expansion 
for e” stated in equation (17.10). 

First step: you can use a test for convergence of a series, for example the 
ratio test. Use that to prove that the series 


(17.11) 


is convergent, for every A. 

Second step: remember from calculus that a convergent power series can 
be differentiated term by term in the interior of its interval of convergence. 

Let f(A) be sum of the series in equation (17.11). Differentiate that series 
with respect to A, and show that f’(A) = f(A). 

Third step: Find the derivative of f(\)e~*. 

Fourth step: After finding f(0), finish the proof of equation (17.10). 


Definition 17.3 (Poisson random variables). Let \ be a nonnegative 
real number. 
Let N be any random variable such that for all nonnegative integers k, 


k 


P(N =k) = aa (17.12) 


We say that N has a Poisson distribution with parameter A, and we also say 
that N is a Poisson random variable with parameter 4. 
(Equation (17.9) shows that the this definition makes sense.) 


The Poisson distribution is interesting in its own right, and also gives us 
a reliable approximation for the binomial distribution when equation (17.4) 
is applicable. 


Exercise 17.3 (Expected value of a Poisson random variable). Let 
N be a Poisson random variable with parameter A. Prove that 


E[N] =). (17.13) 
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Exercise 17.4 (Convergence of expectations). Let S,, be the number 
of successes in n tosses of a coin with success probability p,. Assume that 
him, igs Dn = A: 
Let N be a Poisson random variable with parameter A. Prove that 
lim E[S,] = E[N]. (17.14) 


NCO 


Equation (17.14) complements equation (17.4), and strengthens our con- 
fidence that the Poisson distribution is a reliable approximation to the bino- 
mial distribution. 


Example 17.4 (Bounding e* for x > 0). Suppose that x > 0. We will 
show that 


1 
e*<1ltat+ ie (17.15) 

A good approach is to use calculus. One can also use a power series. 
Method 1 Let f(x) = e*, g(v) =14+2+4 $a7e*. 

het: J (eye a "a 

g(x) = 1+ ze* + $x%e", so g" (x) = e* + xe* + re® + 527". 

Thus f(0) = 1 = g(0), f(0) = 1 = g((0), and f"(z) < g(x) for all 
nonnegative x. Then for any x > 0, 


f(z) =1+ a f"(t)dt <1+ [ a dt = @ (a): 
Hence for any x > 0, 
f(x) =1+ [ f'(()dt<1+ [a dt = g(a). 
The idea of comparing the effect of the accelerations of two cars, when 


they start off with equal position and velocity, gives a good picture for this 
argument. 
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Method 2 


ad 


ak 
ear sitet box GED 


Notice that (7 + 2)! = G+1G+2)QG!) > 2(!). Hence 


2 = 2 a) =: 1 2 a i 
e Slte+) wy altet se qa ltet gee. 
7j=0 j=0 


Exercise 17.5 (More than one Poisson success). Let N be a Poisson 
random variable with parameter A. Prove that 


1 
P(N>1)< a (17.16) 


Note that P(N > 1) =1—P(N < 1). One might try setting z = A in 


equation (17.15). 


If we think of N as the number of successes, Equation tells us 
that when A is small, the chance of more than one Poisson success is small, 
even in comparison to the small chance of one Poisson success. 

Let’s compare Exercise [17.5] to coin tossing. 


Exercise 17.6 (More than one coin toss success). Let W be the num- 
ber of successes in m independent Bernoulli trials, each trial having success 
probability p. Use subadditivity to justify the following bound: 


P(W)>1) < = (17.17) 


Suggestion: for any indices 7 < j, let A;; be the event that both trial 7 and 
trial 7 give success. Note that {W > 1} is equal to the union of all such 
events. 


In Exercise if we assume that mp & A we see that equation (17.17) 


is consistent with equation (17.16). 
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Physical settings and Poisson arrivals 


We derived the Poisson distribution by thinking about the number of suc- 
cesses using a low-probability coin which is tossed very rapidly during a 
given time interval. So we might refer to N as (approximately) the number 
of successes. 

But we mentioned at the beginning of Section [17.2] that there are many 
physical situations in which Poisson random variables arise. In the telephone 
call example, N is the number of telephone calls that arrive in an office during 
a fixed time interval. The office might be an office for telephone sales, or 
perhaps a help center. In the computer server example, N is the number of 
requests that arrive at a computer server which is connected to a computer 
network. 

With these settings in mind, a Poisson random variable is often referred 
to as “the number of Poisson arrivals”. 

An impressive list of other examples for the Poisson distribution is given 
in Chapter VI of Feller’s text [8]. The author mentions that the Poisson 
distribution does not just apply to random arrival times, but also applies to 
random points in the plane or in space, so that “ Stars in space, raisins in 
cake, weed seeds among grass seeds, flaws in materials, animal litters in fields 
are distributed in accordance with the Poisson law”. 


17.3 Poisson approximations on all time in- 
tervals 


To describe the Poisson approximation for a family of time intervals, we again 
imagine a sequence of experiments. We will speak about the experiments 
as coin-tossing, but the mathematical formulas apply to any of the other 
physical situations just mentioned. 

As before, in experiment n a coin with success probability p, is tossed 
many times. But we now imagine that the tosses are performed during a 
time interval named J. And for any subinterval J of that time interval, we 
will keep track of the number of successes during that time interval. 

Let |J| denote the length of J. (We’ve used a different notation for length 
in the past, but |.J| is convenient here.) 

The rate of tossing is assumed to be large, and it is assumed to be con- 
stant, so the number of tosses during a finite subinterval J of I is roughly 
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proportional to |J|, as long as J is not too tiny. 

Actually, no matter how tiny J is, when n is sufficiently large the number 
of tosses during J will be roughly proportional to |.J/]. 

Let ¢,(J) be the number of tosses of the coin during J. Thus @,(J) is 
the total number of tosses during the whole time interval J. In our previous 
discussion the total number of tosses in experiment n was equal to n. 

Just as before, we assume that the probabilities p, are such that @,([)pn 
converges to a limit. Let’s call the limit L. In our earlier discussion we called 
the limit \, but now let’s use \ to denote L/ |I|. Thus \ represents a rate of 
success per unit time and per unit length. 

By definition, 


(17.18) 


Since the number of tosses during any subinterval J is assumed to be 
proportional to the length of the interval, we also have 


d= tim Po 


noo | J| 


(17.19) 


The whole analysis of Section applies to the tosses during any time 
subinterval J. In the earlier analysis we had np, — A. Now we have 
ln(J)pn > A|J|, so the following definition is appropriate. 


Definition 17.5. Let N(J) be a Poisson random variable with parameter 
A|J|. Thus 
r|J|)* 
P(N(J) =k) = AMDT eo. (17.20) 


In addition to what is stated in Definition we have an additional 
property in mind: we picture all the Poisson random variables N(./) as being 
defined on the same sample space. This is not something we can define with 
mathematical rigor in the present book, but it is perfectly possible, and it 
lets us represent things that make sense physically. 

For example, N({0,1)) + N([1,2]) represents the number of Poisson ar- 
rivals during the time interval [0,2]. N((0,2]) represents the same physical 
random quantity, so N(({0,2]) = N({0,1)) + N({1,2]) should hold. And it 
does hold, in the right mathematical model. 
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The next lemma states an important property of Poisson random vari- 
ables. 


Lemma 17.6 (The sum of independent Poisson random variables is 
Poisson). Let N,, No be independent Poisson random variables with param- 
eters A,, Ag respectively. Then N, + No is a Poisson random variable with 
parameter A; + Ag. 


To motivate this lemma, think about disjoint intervals J, J. whose union 
is an interval. Physically, 


N(A,) + N(Jo) = N(J, U Ja), (17.21) 


since we are just adding up arrivals. 

Coin tosses during different time intervals have no influence on each other, 
so the random variables N(J,) and N(J2) should be independent. 

Thus we expect that the statement of the lemma applies when Ni = 
N(J,) and No = N(J2), for any intervals J, Jo. 

That makes the lemma seem plausible. The actual proof is a short com- 
putation. 


Proof. For any k = 0,1,... we have 


P(N, +N2 =k) = 5) P({M =i} N {No = k — 3}) 


k k 
=, u (‘) Noe ee Ve) (Ar As)” Cr +%) (17.22) 


Exercise 17.7. Justify the last two equalities in equation (17.22). 
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Exercise 17.8. By additivity, E|N; + N2] = E|N,|+E [2] must hold when- 
ever the expected values exist. Check that this is true using for the random 
variables N,, Nj in Lemma|17.6| using equation (17.13). 


Exercise 17.9 (The variance of a Poisson random variable). Let NV 
be a Poisson random variable with parameter 4. Show that 


Var (N) =). (17.23) 


It may be helpful to calculate E|N(N — 1)] first. 


Let N,, No be independent Poisson random variables with parameters 
Ai, A2 respectively. By Lemma 17.6] N,+ No is Poisson with parameter A, + 
A2. Thus equation shows that Var (N, + No) = Var (Ni)+ Var (N2), 
which of course is consistent with equation (16.35). 


17.4 Waiting for a Poisson arrival 


Back to the help center! 

Suppose you are sitting at a desk in a help center, stoically waiting by 
your telephone for the first call of the day to arrive. Let 7 be the random 
variable which represents how long you must wait. In the case of clicks of a 
Geiger counter, a similar random variable represents the time until the first 
click. We would like to know the distribution of T. 

For example, given a particular time t, what is P(r > t)? We can answer 
this question surprisingly easily, by thinking about N((0,t]). The event that 
T > t is exactly the event that N([0,t]) = 0. Since N((0,t]) is a Poisson 
random variable with parameter At, 


PGS) =PN(07)ai)ee™. (17.24) 


The tail of a distribution characterizes the distribution, so equation (17.24) 
tells us that 7 has an exponential distribution (recall equation |15.8). 
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Now let’s compare equation with coin-tossing. 

Consider a sequence of experiments. In experiment n a coin with success 
probability p, is tossed again and again during a long time interval I. 

As in Section we assume that there are @,,(.J) tosses during a time 
interval J. For each J, for large n we have by equation that 


and so 


ln(J)Pn 
a ae 


Thus we convert from tosses to times by the following approximate formula: 


Bee number of vos 4 Pn (17.25) 


Let W(n) denote the number of tosses required to obtain the first success. 
That is, W(n) is the smallest positive integer k such that toss k produces a 
success. Then the distribution of W(n) is the geometric distribution (Defi- 
nition {13.3). Thus the distribution of W(n) is given by equation the formula 


in equation (13.12): 


P(W(n) > k) = (1—pn)*. 


Let T(n) be the time until the first success. Then 


Pirin) >) P (SO 1) =P (Ww. > =) x - a). 


Pn 


By Lemma 
P(T(n)>t)ee™. (17.26) 


Since P(r > t) = e~“, this is consistent with the Poisson approximation. 


17.5 Solutions for Chapter 


Solution (Exercise |17.1). Just set 2 = A in equation (17.10). Then 
multiply both sides by e~*. 
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Solution (Exercise |17.2). 
Step 1 Ye Olde Ratio Test: 


Akt U-d 
(k+1)! _ 
om HK 1 


So the series converges. 


Step 2 
/ d ward GM GM 
PA) dd RY dX k! Df k! 
= k=1 k=1 
SCART SAI 
= =F =f) 

za 1) j=0 Me 

Step 3 


(FE) = fle — fA)e™ = 0. 
By the Mean Value Theorem of Calculus, the expression f(A)e~* is constant, 
so f(A)e* = f(0). 
Step 4 f(0) =1. 
since 7 (Ale = 1, 7) =e" 


Solution (Exercise (17.3). 
= AF one ee 
E|N| = k—e* = k—e* = — 
N= DOR e = ee = 2 gaa 
oo yk-l - oo \ r 
=) gai -1(So%e = 
k=1 j=0 


Solution (Exercise |17.4). Let S,, be the number of heads obtained in 


experiment n. Then 
E[S,] = Pn 7 A= EN]. 


Solution (Exercise |17.5). By equation (17.12), 


P(N <1) =P(N =0)4+ P(N =1)=e%+dAe%=e%(1+A). (17.27) 
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By equation (17.15) in Example |17.4| 
d Ce 
eo<1+A+ 54 e. 
Thus 
1 
1+ 2 e* = Be. 


Applying this inequality to equation (17.27), 
Ant a. aes 1\9 


Thus 1 
P(N >1)=1-P(N<1)< > 


Solution (Exercise |17.6). To say that {W > 1} occurs is the same as 
saying that there are at least two tosses, say toss 7 and toss 7, with 2 < J, 
such that both of those tosses give success. Thus {W > 1} occurs when at 
least one of the events A;; occurs. That is why 


i<j 


By sub-additivity (Theorem |2.26), 


P(W > 1) <5 P(Ay). 


t<Jj 


For each i # j, P(Aj;) = p”. 
The number of pairs 7,7 such that 7 < j is exactly the same as the 


number of subsets of {1,...,m} containing two elements. Thus there are 
Cy’ = m(m — 1)/2 such subsets, by Lemma|s8.1| 
Hence 
—1 
Pw >1) < MM) 


verifying equation (17.17). 
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Solution (Exercise [17.7) . The second last equality holds by the definition 


of i 


di eM a 1 i \k-i,— M1 gra 
i! © bar “Maem? 
1 
— Moe 4 —(Ai+A2) 
il(k — i)! 
1 ki i \k— te —(A1+A2) 
~ Klal(k — i)! Aire 


1 /(k i\k— te —Ai4A2) 
=a; )x Xk 


The final equality holds by the binomial theorem (equation (8.5)): 
* (k 
Do (FAR = On + aa) 


41=0 


Solution (Exercise . By equation (17.13), E [Ni] = 1, E [N92] = 2, 


and, since N, + No is A with parameter A] +A2, E[N, + No] = Ar +2. 


Solution (Exercise . By Theorem [14.11] 
E[N(N -1)] => k(k- ae = So kk - l) ye 
k=0 


k=2 
a ae =e 

=, 48 Ly 9 Wok = 58 

— 2 (ka aI° r 2 a rv 


Thus E [N? — N] = \’, and so E[N?] = \* + E[N] = \? +). 
Hence Var (N) = E[N?] — (E[N])? = 2+ A—)? =). 
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Chapter 18 


Normal random variables and 
the Central Limit Theorem 


18.1 Sums of independent random variables 


Let S, be the number of heads obtained in n tosses of a coin, where the 
probability of a head on any toss is p. Then S,, has a binomial distribution 
with parameters n,p (see Example [9.8). 

We can calculate P(S, = k) using equation (9.3), but this formula 
doesn’t seem to give us a sense of the main features of the distribution, 
especially when n is large. We do obtain some insight by calculating E[S,,] 
and Var (S,,), but that is only a start. In the present chapter we will go 
much farther in understanding the distributions of random variables which 
are similar to S,,. 

In the experiment of tossing a coin repeatedly, let X; = 1 if toss 7 results 
in a head, and let X; = 0 otherwise. Then S, = X;+...+ Xp. The fact 
that S,, can be written as a sum of independent random variables is the key 
to understanding its distribution. 

We saw this already in the calculation of the mean and variance of S,. 
Writing S, as a sum allowed us to use additivity of expectation, showing 
that E[S,] = E[X1]+...+ E[X,]. Similarly, equation shows that 
Var (S,,) = Var (X,) +... + Var (X;,). 

Notice that the mean and variance of S,, are completely determined by the 
means and variances of the X;, and do not depend in on any other properties 
of the distributions of the random variables X;. 
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One can say much more about the distribution of S,,. In a very wide range 
of cases, when S;,, is the sum of independent random variables X1,..., Xn, 
and n is large, the distribution of S,, is approximately described by a simple 
analytical formula, and the formula does not depend on the details of the 
distributions of the random variables X;. This is the content of the Central 
Limit Theorem (see Theorem [18.16] and Example [18.19). 

To express the Central Limit Theorem, we will introduce a new proba- 
bility distribution, called the normal distribution. This distribution has a 
smooth probability density, whose graph has a characteristic shape, some- 
times called a “bell-shaped curve” (see Figure [18.7). 

The Central Limit Theorem is a mathematical result, but it is consistent 
with experiment. Many physical random variables have distributions which 
are approximately normal. Because the normal distribution plays such an 
important role, we will develop its properties carefully. 


18.2 Plotting the binomial distribution 


The binomial distribution is a very special case, but we can motivate the 
Central Limit Theorem by plotting the values of this distribution. 

In the present section we will take p = 1/2, so S,, represents the number 
of heads in n tosses of a fair coin. 

In Figures and we take a straightforward approach and 
graph all values for the probability mass function of S,, when n = 100, 
n = 1000 and n = 10000. 

There are n + 1 points in the range of S,, namely 0,1,...,. In these 
figures we are plotting the points (k, P(S, =)), for all k in the range of 
Bins 

When the points are close together, they may appear to form a continuous 
curve, but that is just a limitation of the picture. 

The three graphs, for n = 100, n = 1000, n = 10000 don’t look very 
similar, although in each case there is a single maximum at the center of the 
graph, which occurs at the mean value of S;,. 

Notice that in each case the only significant probability values are found 
relatively near the mean value of S,,. It interesting that the interval in which 
the probability is concentrated is so small, relative to the whole range of S;,. 
This is especially true as n gets large. 
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Figure 18.1: P(S,, =k) versus k for n = 100. 


0.025 


0.020 


0.015 


0.010 


eeenvnenve 
eoeoveeee 


0.005 


eeeeee 
esooeeeee & 


200 400 600 800 1000 


Figure 18.2: P(S,, = k) versus k for n = 1000. 
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Figure 18.3: P(S,, = k) versus k for n = 10000. 


Because the probability is so concentrated, it is hard to see details in the 
graphs when n is large. We have to do something about that problem. 


For convenience, let us use the name J, to refer to the interval where 
significant probability values are found in the binomial distribution. We 
might refer to that region verbally as the part of the graph where the “main 
values” of the distribution are located. 

I, is not a precisely defined interval but we can see roughly where it is, 
by looking at each graph. Let |J,,| denote the number of points in J,, As n 
increases, |J,,| increases along with n, but it evidently increases more slowly 
than n. 

The smallness of |/,,| relative to the size of range of S,, makes the graph 
of the probability mass function appear more and more sharply peaked, as 
m increases. 

On the other hand, since the number of points in J, 7s growing larger, we 
are not surprised that the maximum value for P(.S,, = k) becomes smaller, 
since the total probability is shared among more points. 

It is hard to see the precise shapes of the graphs in Figures 
and since the interesting part of the graph is being squeezed into a 
narrower and narrower spike as n gets large. In order to see more clearly 
what is going on, we need to focus our attention on the central region J. 
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Figure 18.4: Main values of P(S,, = k) for n = 100. 
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Figure 18.5: Main values of P(S,, = k) for n = 1000. 
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Figure 18.6: Main values of P(S,, = k) for n = 10000. 


To focus on I, let’s choose a precise definition of J,,. We'll define a little 
bit of terminology, just for this discussion. 


Terminology 18.1 (The “main part” of the distribution). Suppose 
that we consider a small probability value 6, say 6 = .001. 

For each n, let €(n) be such that P (S,, < ¢(n)) is approximately equal to 
6. Since 6 is small, the values of S;, are unlikely to be further to the left than 
L(n). 

Similarly, let u(n) be such that P (S,, > u(n)) is approximately equal to 
6. So the values of S,, are unlikely to be further to the right than u(n). 

In our discussion we will refer to the part of the range of S,, which lies 
within [¢(n), u(n)] as the “main part” of the range. 

From the definitions, [¢(n),u(n)] is an interval that contains 1 — 26 of 
the probability for this distribution. If 6 = .001 then this interval contains 
99.8% of the probability for this distribution. So the term “main part” seems 
justified. 

The values of @(n) and u(n) will of course change if we choose some other 
value for the small probability 6. In our discussion we will stick to using 
6 = 001, 


Using a computer, we find that [@(n), u(n)]| is [35,65] when n = 100, 
[451,549] when n = 1000, and [4845, 5155) when n = 10000. The number of 
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points in [¢(n),u(n)] is 31 when n = 100, 99 when n = 1000, and 311 when 
n = 10000. 

Figures and show the graphs of the probability mass 
functions only for & in [€(n),u(n)] and n = 100, n = 1000 and n = 10000. 
So we are zooming in on the interval [@(n), u(n)]. This is where the action 
is, for these distributions. 

Restricting our attention to k in [@(n),u(n)] has made it possible to 
see the shape of the graph of the probability mass function much more 
clearly. It is striking how similar in shape these three graphs are now, for 
n = 100, 1000, 10000. 

Figure [18.6] looks like the graph of a continuous curve, but of course it is 
still obtained by plotting a finite number of points. In the next section we will 
introduce a continuous function which has the shape shown in Figure [18.6] 


18.3 A function with the right shape 


We will derive some properties of the function e-® whose graph is shown in 


Figure The shape is similar to Figure The Central Limit Theorem 
will show that this resemblance is not an accident. 

Lemma 18.2 (An important integral). The function e~” 
on R, and 


is integrable 


[ ce dx = Vn. (18.1) 


co 


Proof. The usual way to finding ee e~®’ dx would be to evaluate 


i He 
lim e” dx. 
aoo J 


a 

Unfortunately, we can’t even start this procedure. The function e~® does 
not have an antiderivative in terms of the calculus functions that we know 
and love. So we cannot calculate 
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-2 -1 1 2 


2 


Figure 18.7: the graph of f(a) =e~* , a “bell-shaped curve’. 


To even show that i een f(x) dx exists we will have to use a comparison prin- 
ciple. If we can find a nonnegative function g such that nee g(x) dx exists, 
and if e~* < g(x) for all x, then I e- dx exists. 

In the present situation, the reader can easily check that for any x, 


x? > |2|—1. 


(Consider the case |x| < 1 and the case |x| > 1 separately.) 
Since —(|z|—1) > —2?, and since the exponential function is an increasing 
function, 
eile > er 


holds for all x. And we can certainly use ordinary calculus methods to 
show that io e!-l*l dx exists. (Using symmetry, this integral is the same as 
2, ede =e | e* dx, ete) 

Hence, by the comparison principle for integrals, we know that [ uae ee dr 
exists. 

But we want to know the value of this integral, not just that it exists. To 
evaluate the integral, we use a trick! 
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First, we move the problem to R?, by noticing the following convenient 
fact. 


Lj? dear -/ / goer) dank =F / ee ds dt 
sé —oo J —00 TOY — 00 
oo 7 oO 2 2 : 
=(/ e* is) (| e! iw) =(/ e* is) 


We can use polar coordinates to evaluate the integral over R?. 


20 lee) lee) 
| jer" ds dt = | i e rdrdé = an | e rdr. 
e. 0 JO 0 


Now we have an integral for which calculus methods work. 


ia 2. b 2 2 : 
| e° 2rdr = lnm e " 2rdr = lim —e" 
0 


boo 0 boo 


=n (1 2 ec) 2%, 


0 boo 
(18.2) 


Thus 


Exercise 18.1 (A random variable whose distribution density has 
the right shape). Equation (18.1) tells us that 


| 2 
—e" dy =1 
L# ae 
1! 


sO ae. is a probability density. 


Let Y be a random variable whose distribution has probability density 


1 ,-y’ 
ae 


(i) By equation (15.5), 


E[Y?] = / ° Poet dy, (18.3) 


oe) 


provided that the integral on the right exists. 
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Show that E[Y?] exists and 
By?) =172. (18.4) 
(Integration by parts is useful here.) 


(ii) Show that 
E[Y]=0. (18.5) 


(iii) Show that 
E[|Y|] = —. (18.6) 


The density oo used in Exercise is an example of a normal 
density. 


Definition 18.3 (Normal densities and distributions). Let «,m be real 
numbers with « 4 0. Let g be the probability density given by 


1 _ (a=m)? 
g(x) = ore Kk : (18.7) 


Then g is said to be a normal probability density. 

Any distribution with a normal density is a normal distribution, and any 
random variable with a normal distribution is said to be a normal random 
variable. 


Incidentally, the word “normal” in this definition is a special usage. One 
should not draw the conclusion that there is something wrong with a random 
variable if it is not a normal random variable. 

Normal densities are also referred to as Gaussian densities (in honor of 
the mathematician Carl Friedrich Gauss). 

Let Y be a random variable like the one in Exercise meaning that 
the distribution of Y has probability density oe Exercise [18.2] will show 
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that a probability density g for the distribution of KY +m is given by equation 
(18.7). 

The most common way of writing a normal density is actually the one 
given below in equation (18.20), which differs slightly from equation (18.7). 
It is also useful to be able to recognize normal densities which are written in 
other forms (see Lemma 18.8). 


18.4 Simple transformations of distributions 


Before discussing properties of normal distributions, we need to discuss trans- 
formations of distributions. This will give some motivation for the definition, 
and is relevant for another reason. In applications we often have to transform 
one normal distribution into another. 

In Exercise of Section we introduced a random variable Y whose 
distribution has a probability density os We said that the shape of the 
graph of this density resembles the shape of the distribution of a binomial 
random variable S,,. 

Of course, we haven’t defined exactly what is meant by “shape”. But in 
this section we note some transformations which preserve the kind of shape 
we are interested in. 

You’ve likely seen such transformations when sketching graphs of func- 
tions in calculus. For example, after you’ve seen the graph of y = 2”, you 
can easily sketch the graph of y = 57x, without plotting any points. 

Just to have some terminology, we will speak of shifting a graph (moving 
the graph horizontally or vertically), scaling a graph (stretching the graph 
horizontally or vertically), and reflecting a graph (vertically or horizontally). 

These transformations are carried out in the following ways. 


e The graph of a function f is moved to the right by m to obtain the 
graph of f(a—m). (If m is negative this actually means that the graph 
is moved to the left.) We refer to this movement as shifting the graph. 
(The graph could be moved up by 6 to obtain the graph of f(x)+6, but 
we won’t have occasion to do this for graphs of probability densities.) 


e The graph of a function f is stretched horizontally by « > 0 to obtain 


the graph of f(a/«). And when « < 0 the graph of f is stretched by 
|«| and reflected horizontally to obtain the graph of f(2/k). 
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e The graph of a function f is stretched vertically by & > 0 to obtain the 
graph of « f(x). And of course when « < 0 the graph of f is stretched 
vertically by || and reflected vertically to obtain the graph of Kf(z). 


Remark 18.4 (Stretching factors). The graph of f(xz/100) is 100 times 
wider than the graph of f(«). Can you see why? 

The reason is that everything happens 100 times more slowly when the 
function is f(z/100). Consider moving from x to +A. This causes a change 
f(a + A) — f(x) in the value of f. When the function is f(x/100), you will 
need a change of 100A in x to obtain the same change in the value of the 
function. 


When f is a probability density, the function (1/|K|)f((@ — m)/k) is 
again a probability density, as you will show in Exercise [18.2| 


Exercise 18.2 (Scaling and shifting probability densities). Let X be 
a random variable whose distribution has a probability density f. Using 
Definition [3.4]and Remark [9.11] check the following. 
(i) Let m be any real number. Let W = X + m. 
Show that f(w—m) is a probability density for the distribution of W. 


(ii) Let « be any nonzero real number. Let V = KX. 


Show that (1/|«|)f(v/«) is a probability density for the distribution of 
V. 


(iii) Let k,m be any real numbers with « 4 0. Let U = «KX +m. 


Show that (1/|«|)f((u—m)/«) is a probability density for the distri- 
bution of U. 


When f is a probability density, we can sometimes understand the prop- 
erties of f more clearly by considering a transformed version of f, such as 
(1/|K|)f((@ — m)/«), for suitable values of m and kK. 
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Lemma 18.5 (Variance and mean for a scaled and shifted density). 
Let f be the density of the distribution of a random variable Y. Suppose 
that E[Y] and Var (Y) exist. 

Let «,m be numbers with « 4 0. Let h be the function defined by 


1 1 
pa) = nit(<e-m). (18.8) 
Let W be a random variable whose distribution has density h. 


Then W and KY +m have the same distribution, and 
Var (W) = K°Var(Y), E[W] =E[Y]4+m. (18.9) 


Proof. Exercise tells us that h is a density for the distribution of KY +m. 
Hence Var (W) = Var (KY +m) = «Var (Y) and E[W] = E[KY + m] = 
E [KY] +m. 


Corollary 18.6 (Variance and mean for a normal density). Let Y 
be the random variable defined in Exercise |18.1} so that a density for the 
distribution of Y is the function f defined by 
1 2 
Ee 
fy) a 
Let W be a random variable whose distribution has density h, where h is 
defined by 


1 1 =(28) 2 1 _ (a—m)? 
= —— K = K : 18.10 
MO) Tel va Vien oo 


Then W and KY +m have the same distribution, and 


Var (W) = «Var (Y) = “©, E[W] =E[Y]4m=m. (18.11) 


Proof. We showed in Exercise that E[Y?] = 1/2 and E[Y] = 0. 
Equation (18.9) then gives equation (18.11). 
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Remark 18.7 (Any scaled and shifted normal is normal!). The argu- 
ment in Corollary [18.6]shows that a random variable X is a normal random 
variable if and only if for some real numbers k,m with « 4 0, we have 
X = KY +m, where Y is the random variable defined in Exercise [18.1] 
When X = KY +m, suppose we now form a new random variable W = 
TX + v, where 7,v are real numbers and Tr # 0. 
Note that 


W=7(KY¥ +m) +v=TKY + (7Tm+). 


Hence W is also normal. 


18.5 Properties of normal densities 


: : . . a =p? 
Because any normal density involves a function similar to e~” , and such func- 


tion do not have elementary antiderivatives, one might suspect that compu- 
tations with normal densities must be hard. But the calculations performed 
in this section are easy, even though dealing with e~”’ is inherently messier 


x 


than dealing with e~*. 


Lemma 18.8 (Equivalent forms of normal densities). The following 
statements are equivalent. 


(i) g is a normal density, i.e. for some k,m with Kk # 0, 


1 _(-m)? 
g(a) = e (18.12) 


V KAT 


(ii) g is a probability density such that 


where ko, k, ko, d are constants and ky > 0. 


Equations (18.12) and (18.13) hold for the same probability density g when: 


oa ill 
ko = a, 


18.14 
ky = _ 2m ( ) 
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and also 


2 
me (18.15) 


Suppose that g given by equation ( 


Proof. Equation (18.12) to Equation (18.13): 
18.12). 


(a—m)? 


K2 


Expanding the square in the exponent, we can rewrite as 
kx? + kyx + ko, 


where equation (18.14) holds and also 


2 
ko =. (18.16) 


Thus equation (18.13) holds with d= Te and equation (18.15) holds. 


Equation (18.13) to Equation (18.12): 
18.13 


If g is given by equation (18.13), we will obtain an equation for g which 
is similar to equation (18.12), by “completing the square”. (Readers have 
likely seen such manipulations before. Appendix [H] recalls that procedure.) 

Let 


and let 


Then expanding (x — m)? shows that 


o> m)? 2 m? 2 ki 2 ki 
2 Rox +hizt— = kox Aas = kox*+kyx+kot+ re ko |} . 
Hence 
2 
jen (tat?+ha tho). ay tho a eae 
and so 2 
de~ (hae? +kia+ko) = ites Oe cn) (18.17) 
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Since g is a density, 


eee dx = 1. 


, em? 
: ae . 
Since Tame © isa density, 


1 _ (=m)? : _(e=m)? 
/ e w« =lie fe 7 =VkK?n. 


K27 


Thus integrating equation (18.17) shows that equation (18.15) holds. Since 


equation (18.17) shows that g satisfies equation (18.12). 


Example 18.9 (Identifying densities using the variable parts). Let h 
be a probability density given by h(x) = de (an? +be+e) for some constants 
0,6: 
Let h; be probability density given by hy(x) = dye (ue? thieter) for some 
constants d,, a1, b,, 1. 
Suppose that a, = a and b; = b. We will show that then 
die “ = de‘, and so hi = h. (18.18) 


Indeed, since [fh =1= f hi, 
[dete (eee) = fae te re 


That is, 
epee) = er f e-(ne*the), 


and the two integrals in this equation are identical, so equation (18.18) holds. 
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Remark 18.10 (Absorbing a constant). Suppose that a probability den- 
sity h is written as: 


h(x) = de~ (ar thee) (18.19) 


where d,a,b,c. are constants. As we have noticed, we can write 


g(x) = ren (an? tba) 
where the r = de~°. In this situation one sometimes says that we have 
absorbed the constant c into the constant r. 


We have deliberately written several different-looking formulas for normal 
densities, because normal densities may be encountered in such forms. But 
there is a “best” way to write any normal density, as follows. 


Lemma 18.11 (Mean and variance: the best form of a normal den- 
sity). Suppose that the distribution of X has a normal probability density g. 
Then g can be written as 


g(x) = a oe; (18.20) 


where 
Var (X) =0?, E[X] =u. (18.21) 


Any normal density is completely determined by its mean and variance. 
If W is a normal random variable such that Var (W) = tVar (X) and 
E[W] = VtE[X] +1, then 


W and VtX + v have identical distributions. (18.22) 


Proof. By Corollary |18.6} with x = 20, we have E[X] = p and 


2 2 
Var (X) = > =o-. 


These facts give equation (18.21). 
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Equation determines o in terms of Var (X) and jy in terms of 
E[X]. So to show that X and VtX + v have identical distributions, it is 
sufficient to check that these two random variables have the same variance 
and the same mean. 


It follows from Remark that all normal random variables can be 
obtained from any one normal random variable by scaling and shifting. So 
there is no reason to think of the distribution of any normal random vari- 
able as being special! However, we will pick one normal distribution to be 
“standard”. 


Definition 18.12 (The standard normal). Let Z be a random variable 
with probability density 7 given by 


nl2)=——e 2. (18.23) 


The mean and variance of Z are particularly simple: E |Z] = 0, Var (Z) = 1. 
For this reason, Z is given a special title, and is said to be a standard normal 
random variable. The density 7 is called the standard normal density. 


The graphs of all normal densities have a similar shape. For what it’s 
worth, a graph of the standard normal density is shown in Figure [18.8} 

Notice that all derivatives of 7 exist, so the smooth appearance of the 
graph in Figure [18.8] is not an illusion. 


Exercise 18.3 (Standardizing a random variable). Let X be a random 
variable, with Var (X) = 0? and E[X] = uy. 
Let 


(18.24) 


Check that Var (7) = 1 and E[Z] = 0. 
Thus whenever X is normal, (X — j1)/o is standard normal! 
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the standard normal 
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Figure 18.8: The standard normal density 7. 


Of course, whenever equation (18.24) holds, we have 
X=oZ+h, (18.25) 
which can be convenient. 


Example 18.13 (Scaled probabilities for normal deviations). Let X 
be any normal random variable, with mean j and variance o?. Then X and 
oZ + yt have the same distribution. Hence for any interval [a, }], 


P(X — uw € [a, b]) = P(oZ + p € [a, d)) (18.26) 
=P(a<oZ+yu<b) 


P 
=P(ze (ot). 


In particular, for any c > 0, taking a = w — co and b= w+ co, we have 
P(|X — p| > co) = P(|Z| > c). (18.27) 


Thus for any o, we can conveniently calculate the probability of a deviation 
by X from its mean by measuring the size of the deviation of X from its 
mean in units of o. 
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For example, 


P(|X — | > 0) & 0.31731050786291415, 
P(|X — p| > 20) & 0.04550026389635839, (18.28) 
P(|X — p| > 30) & 0.0026997960632601866. 


Thus a deviation by one standard deviation is not uncommon, while a 
deviation by three standard deviations is rare. 


Example 18.14. Let Z be standard normal. We will derive the following 
facts. 


(i) 
E||Z|] = 7a (18.29) 


(ii) For each nonnegative integer n, if E [|Z|"] exists, then E [|Z|"**] exists, 
and 
E [|Z|"*?7] = (n+ DE[|Z|"]. (18.30) 
(Since E IZ1°] = E/[1] = 1, equation (18.30) tells us again that the 
variance of a standard normal random variable is equal to 1.) 
(iii) E[|Z|"] exists for each nonnegative integer n. 


Proof of (i) Let Y be the random variable defined in Exercise es By 
that exercise, Var (Y) = 1/2, E[Y] =0, E[|Y]] exists and E[|Y|] = 
By Deere pene 18.3) /2 Y is a standard normal random variable. — 
E{|Z||= 


Proof of in For any nonnegative integer n, assume that E{|Z|"] exists. 
We will show that E ae | exists, and equation (18.30) holds. 
Note that 


E[|Z|"] lk cP Rat = 2 | * pe PP at. 
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Similarly 


i 2 1 ~ 2 
! | V Qn —co | | J 20 (@) 


in the sense that the expected value exists if the integral exists, and they are 
equal. 
For any 6 > 0, using integration by parts we have 


z ff prt2,-O/2 gy — __ = _yntl, sy tae n+l weer dt 

V 21 ay 20 V2 
i 2 Lf 2 

5 ay eT i Pa de 18.31 
om ( ) oe, ( ) 


By L’H6pital’s rule, limy,. b”tte-"’/2 = 0. Also, 


b lee) 
lim | t"e-?/2 dt = | tre? /2 dt. 
0 


boo 0 


Letting b > oo in equation (18.31), 


1 f° 2 1 ap 
lim —— | t?Pe*? dt =(n+1 — | fe di. 
b>00 V/ 27 i; ( T 27 Jo 


which gives equation (18.30). 
Proof of (iii) Since E |Z" | = E[1] = 1 exists, using equation (18.30) 
repeatedly shows that E[|Z|"] exists for every even nonnegative integer n. 


Since E[|Z|] = 2, using equation (18.30) repeatedly shows that E [|Z|" | 


exists for every odd nonnegative integer n. 


Exercise 18.4. Let Z be standard normal. Show that for any odd positive 
integer, E[Z"] = 0. 


Exercise 18.5 (Testing Chebyshev with a normal random variable). 
Let X be a normal random variable with mean p and standard deviation o. 
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(i) Find the probability density for the random variable Z defined by 


XK = 
o = ae 
Oo 


(ii) Show that for any c > 1, 


2 ss 
P (|X — pw] > co) < ——e 2. 18.32 
(IX — pl 2 60) < (18.32) 


(Hint: convert equation (18.32) into an inequality involving an integral 


1 1 
of the density of Z. Then notice that for z > 1, enB® < zed”. 
This trivial fact can be used to get something that we know how to 
integrate. ) 


(iii) Show that equation (18.32) gives a much sharper inequality than Cheby- 
(16.24)) 


shev (equation , for large c. 


18.6 The Central Limit Theorem 


Suppose that a random variable S,, is equal to the sum of n independent 
random variables, where n is a large number. The Central Limit Theorem 
says that under the right circumstances, a normal distribution can be used as 
a good approximation to the distribution of S,,. This type of approximation 
is suggested by the plots we made of various binomial distributions, in Sec- 
tion The actual statement of the Central Limit Theorem makes a giant 
leap in generality from those binomial examples. We’ll state this theorem 
carefully and then show more examples. 
First we need a simple definition. 


Definition 18.15 (Identically distributed random variables). 

Let X1,...,X, be random variables defined for some probability model. 
If all the random variables X; have the same distribution, we say that the 
random variables in the sequence are identically distributed. 

A sequence of random variables which is both independent and identically 
distributed is said to be an IJD sequence of random variables. 
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The random variables X; in Section which give the results of a 
sequence of coin tosses, are a typical example of an IID sequence. 

The idea of the Central Limit Theorem is sometime expressed by saying 
that any physical random variable whose value is the sum of “many small 
independent effects” should have a distribution which is similar to a normal 
distribution. 

A mathematical special case of “many independent effects” is an IID 
sequence Xj,...,X,. We assume that E[X;] and Var (X;) exist. The basic 
form of the Central Limit Theorem says that for such a sequence, if n is large 
enough then the random variable S,, = X,;+...+Xy, has a distribution which 
is similar to a normal distribution. That is, if n is large enough, and if W,, 
is a normal random variable with the same mean and the same variance 
as S,,, then for every interval J we have 


P(S, € J)» P(W, € J). (18.33) 
The formal statement is as follows. 


Theorem 18.16 (The Central Limit Theorem). Let Xj,...,X;, be an 
IID sequence of random variables. Let S, = X,+...+ Xn. 

Suppose that each X; has mean p and has variance 0? > 0. By additiv- 
ity of expectation, the mean of S, is nj, and equation shows that 
Var (S,,) = no?. 

Let W,, be a normal random variable such that the mean and variance of 
W,, are the same as the mean and variance of S,,. 

For any ¢ > 0, there exists no, such that for all n > no, 


|P(S, € J) —P(W, € J) | <e, (18.34) 


for all intervals J (including both finite intervals J and infinite intervals J). 


A proof will not be given for the Central Limit Theorem. Figure [18.9] 
suggests why it might be true. 


The Central Limit Theorem is sometimes referred to briefly as the CLT. 
It should be mentioned that the CLT can be expressed in various ways, 
and readers may find different-looking statements in other books. We'll also 
restate the theorem later ourselves, in Theorem 
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Figure 18.9: Comparing the binomial distribution (p = .5, n = 1000) with 
the normal density having the same mean and variance (mean = np, variance 


= np(1— p)). 


Remark 18.17 (Approximation, but no rate of convergence). We 
know how to compute numerical probabilities using normal densities, so The- 
orem shows us how to compute an approximation for P(S,, € J) which 
is close to within an error bound of ¢, for every interval J. 

The approximation holds for every interval J for the same n, if n is large 
enough. But the statement of this theorem does not tell us how large n must 
be. 

So, despite its formality, equation (18.34) is not more specific than equa- 
tion (18.33). 

From the standpoint of applications, we tend to expect such limitations. 
General mathematical theorems tell us what sort of behavior to look for, but 
precise error estimates may not be easy to come by. 

One theoretical estimate is given by the Berry—Esseen Theorem, which 
says that (under slightly more restrictive conditions than Theorem 
there exists a constant C’ such that if W,, is normal and has the same mean 
and variance as S,, then 


C 
P(W, € J) — P(S, € J)| < — 
PW, € J) -P(S, € J) < 
for every interval J. But we won’t take time to present such results. 
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Remark 18.18 (Do we need identically distributed random vari- 
ables in the CLT?). In Theorem it is assumed that the random 
variables X1, X2,... all have the same distribution. This is convenient as a 
simple mathematical assumption, but it is not necessary. 

Remember that it was suggested earlier that any physical random vari- 
able whose value is the sum of many small independent effects should have 
a distribution which is similar to a normal distribution. In a real-world situ- 
ation, it doesn’t seem natural that many independent effects would all have 
the same probability distribution. 

There are more general forms of Theorem in which the random 
variables X,, X2,... are independent but do not all have the same distribu- 
tion. In this situation one must impose an extra mathematical condition to 
obtain the result of the Central Limit Theorem. Very roughly, the idea is 
that the sum for S,, should be made up of terms which are comparable in 
size. 

We won’t pursue this topic mathematically, but such theorems make us 
more confident that the approximation described in the Central Limit The- 
orem is valid in many situations. 


Example 18.19 (A basic approximation example). Let X1,...,X1o0000 
be a sequence of independent random variables. 

Suppose that the distribution of each X; has a probability density f, 
where f(z) = cx? on the interval [—1,2], and f is zero everywhere on the 
complement of |[—1, 2]. 

The constant c is of course determined by the fact that [ f = 1. 

Let m = 10000. Our goal in this example to find the approximate value 
of P(S,, < 12600). 

The Central Limit Theorem suggests a way to do this, by means of a 
normal random variable W,, with the same mean and variance as S,,. 

So we need to find the mean and variance of S,,. 


Since [ f = 1, 
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Thos ¢= 1/3. 
Then 
2 4|2 
Cx l6c ¢ l5e 5 
E[X;] = 2dxr=——| = 2 = 
[X;] I. wen” dx = = ; fl fl Z 
Also ; ; 
: 32c c 33c 11 
E[xX2] = ge ieee | VS mae 
f =| fe cx" ax Ble 5 = 5 5 
Thus 4 
11 5 11 2 176 — 12 1 
Varixja—S(2) 2 _ 5 76 5 _ 5 
5 4 5 16 80 80 


Thus, for any n, Var (S,,) = 51n/80, while E[S,,] = 5n/4. 

Now we are ready to use the Central Limit Theorem. 

For any n, let W,, have a normal distribution, with Var (W,,) = 51n/80 
and E[W,,] = 5n/4. 

Because of the Central Limit Theorem, we know that as n becomes large, 
the distributions of W,, and S, become similar. 

In particular, when n is large, P(W,, < a) and P(S, < a) are close, for 
all a. 

We will try to use this approximation when n = 10000, and hope that 


P(S,, < 12600) ~ P(W,, < 12600). 


We can calculate P(W,, < 12600) using a computer program. 

Some programs have a predefined function for this purpose, but we won’t 
assume that we have such a program. Instead, we’ll work with the normal 
density and then ask a computer to perform a routine numerical integration. 

Let h be a probability density for the distribution of W,,, with n = 10000. 


Then 
12600 


P(W,, < 12600) = / h. 
What is the formula for h? 
Let m = E[W,] = E[S,] = 12500, and let v = Var (W,,) = Var (S,,) = 
10000 - (51/80) = 125-51. By Lemma 18.20} a valid density h is given by 
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Using a computer program to perform the numerical integration, we find 


_ h aa : 1 (u—12500)? /(2-125-51) d 479677357 7 
= e ad u = 0.8947967735713137. 
=e -o V125-51V27 


So that’s our approximation: 
P(S, < 12600) ¥ 0.8947967735713137. (18.35) 
See Figure|18.10| P(W,, < 12600) is the area of the shaded region under the 
graph. Note that this picture is heavily scaled. If we used the same scale on 
the vertical and horizontal axes, then the graph would be extremely low in 
comparison to its width! 
See Remark |18.29] for further discussion of scaling normal densities. 
0.0035 | 
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Figure 18.10: Graph of the density of W,,, showing P(W,, < 12600) 


Checking the answer 


How accurate is the approximation in equation (18.35)? We are not going to 
discuss theoretical upper bounds for the error. But we’re going to carry out 
a numerical check. 
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Using a computer, we will simulate the independent sequence Xj1,...,Xn, 
with n = 10000, getting a sequence of values v1,..., Un. The sum vy +...+Un 
is a sample value for S,,. 

Call this sample value s,. If s, < 12600, we will say that the event 
{S;, < 12600} occurred in the simulation. 

We'll ask the computer to perform that whole procedure 1000000 times! 

The fraction of the those times that {.S;, < 12600} occurs will be an esti- 
mate for P(S,, < 12600), based on the frequency interpretation of probability. 

We can compare this estimate with the one given in equation (18.35). 

Doing these simulations sounds like a lot of work. But the work is done 
by the computer, not us. It does take some time. 

Incidentally, Section|I.4]discusses an interesting transformation that makes 
the task easier for the computer. 

In any case this simulation procedure is slower than using the Central 
Limit Theorem. There seems to be a trade-off. We have to think harder 
about concepts in order to use the Central Limit Theorem, but there is less 
computational work. 

Doing the simulation 1000000 times, and calculating the frequency, gives 
the following estimate: 


P(S,, < 12600) © 0.894881. 


Is that better or worse than the estimate in equation (18.35)? Your author 
doesn’t know. But at least the two estimates seem consistent. 


We have focused here on a numerical example as a way of understanding 
the statement of the Central Limit Theorem. But it should be noted that the 
Central Limit Theorem is not just a tool for obtaining approximations. It 
also helps us to understand the general behavior of sums of random variables. 
In modern probability theory that is likely the most important application 
of this theorem. 


Exercise 18.6. Let X1,..., Xi0000 be a sequence of independent random 
variables. For each i, P(X = —1) = 3, P(X = 2) = 3, P(X =5) =. 
Find a reasonable approximation to the value of P(Sjo000 < 5200). 
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Your final answer can be in the form of the integral of an explicitly given 
function. 


A slightly different formulation of the Central Limit Theorem will be given 
in Theorem}18.26| Before that theorem is stated, the next section introduces 
some convenient terminology for describing probability distributions on the 
real line. 


18.7 Cumulative distribution functions 


Definition 18.20 ( The cumulative distribution function of a ran- 
dom variable). Let X be a random variable. The cumulative distribution 
function F'y for X is the function on the real line defined by 


Fx(a) = P(X <a). (18.36) 


Often we refer to a cumulative distribution function simply as a “distri- 
bution function”, or use the acronym CDF. 


Recall that the tail of the distribution of X is defined as P(X > 1), 
considered as a function of t (Definition |14.12). Thus the tail function is 
equal to 1 — Fy, and contains exactly the same information as F'y. 


Example 18.21. 

Let X be the random variable whose distribution is uniform on |a, b] and 
zero on the rest of the real line. 

For any t, F(t) = P(X <t). Thus 


0 ift <a, 
Fy(t)= ims if a= t< 6, 
i} irk >p 


Figure |18.11}shows the graph of F'y for the case a = 2, b = 7. 
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1.0 5 


0.574 


0.04 — 


-0.54 


Figure 18.11: CDF for X when the distribution of X is uniform on [2, 7] 


Example 18.22. 
Let Z be a standard normal random variable. Then 


- ; f = F 
th= ——e 2 dz. 
a= | 
Fz increases over the whole real line, with lim, ,_.. Fz(t) = 0 and lim;.,.,. Fz(t) = 


1. See Figure |18.12 


Suppose that (in the days before computers) you wanted prepare a to 
table listing the CDF of a random variable X. You can’t list the value of 
Fy (a) for every real number a, but at least you could list a representative set 
of values, so that users could find approximations to what they need. That’s 
the way mathematical tables work. 

But could you prepare a table which approximately listed the whole distri- 
bution of X, in the same way? Your list would have to show the approximate 
value of P(X € S$) for every subset S of R. This seems hopelessly difficult. 

Even preparing a table listing P(X € (a, }]) for all intervals (a, b] seems 
unbearable. However, the next exercise shows that you can at least avoid 
that task. 
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—3 -2 —-1 1 2 3 


Figure 18.12: CDF for the standard normal 


Exercise 18.7 (Interval probabilities from the CDF). Show that the 
following statement follows from the definition of a CDF. 
Let X be a random variable. For any points a,b in R with a < b, 


P(a< X <b) = Fx(b) — Fx(a). (18.37) 


It is interesting that the CDF of a random variable determines the whole 
distribution. That is stated in the next lemma, but the hard part of the 
proof is omitted. 


Lemma 18.23 (Interval probabilities characterize distributions). 
The following statements are equivalent. 
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(i) X and Y have the same probability distribution. 
(ii) P(X € (a,b]) = P(Y € (a,)]) for intervals (a, 6). 
(iii) Fy = Fy. 


Proof. As usual, we write —> to mean “implies”. 
? 


(i) = (ii) and (i) = (iii) This is immediate from the definition of 
the distribution. 


(ii) — > (i) This part of the proof is omitted! The ideas are not deep but 
require technicalities from real analysis. 


(iii) — > (ii) This follows from equation (18.37). 


We can use cumulative distribution functions to express the Central Limit 
Theorem in a convenient form. One more definition will also be helpful for 
that particular purpose. 


Definition 18.24. For any random variable X, define the function Gy on 
R by 


Gx(a) = P(X <a). (18.38) 


Since {X <a} ={X <a}U{X =a}, we see by additivity that 
Fx (a) = Gx(a) + P(X =a). 


So the Gy notation is not really needed, it’s just convenient here. 
If it happens that there is a density f for the distribution of X, then 


Gx(a)= [f= Fela), 


so Fy and Gy are the same function. 
For general distributions, it is not hard to show that we always have 
Gx(a) = F(a) at any point a where F’y is continuous. 
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The notation in Definition is not standard, but it’s convenient for 
discussing the CLT, because Fy and G'x together allow us to describe prob- 
abilities P(X € J) briefly, for any finite or infinite interval J. The following 
lists all the cases, probably in excessive detail. 


P(X € (a, b]) = Fx(0) — Fx(a), 
P(X € [a,b)) = Gx(b) — Gx(a), 
P(X € (a,6)) = Gx(b) — Fx(a), 
P(X € [a, 6]) = Fx(b) — Gx(a), 
P(X € (—00,5]) = Fx(b), oe) 
P(X € (—00, b)) = Gx(d), 
P(X € (a,co)) = 1—- Fx(a), 
P(X € |a,00)) = 1-—Gx(a) 


All the equalities in equation (18.39) are easy consequences of the definitions, 
(18.37 


just as in the proof of equation (18.37). 


Appendix] has more information about CDF’s and their applications. 


18.8 Rephrasing the CLT 


Sometimes people prefer to think about a standard normal random variable, 
rather than a general normal random variable. In the days before computers, 
when books of mathematical tables were common, it was almost essential 
to convert a probability calculation for a normal random variable into a 
probability calculation for a standard normal distribution. Though no longer 
needed for computation, this type of conversion can still clarify our thinking, 
as we will see in Theorem [18.26 


Remark 18.25 (Standardizing random variables and events). Let X 
be a random variable with mean jz and standard deviation o. 
Let 


x= 
—_ p 


, Le X Hol pe. 


By Exercise Z has mean zero and variance one. We will refer to Z as 
the standardized version of X. 
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It is easy to transform an event such as {a < X < b} into a similar event 
for Z: 


{asx <bp= {Soh Aah ce} (18.40) 
Oo 


Thus 


Plasx<0)=P(*— < < ) (18.41) 
o o o 
And if X happens to be normal, then by Remark we know that Z 
is also normal. Thus by standardizing we can convert any normal random 
variable into a standard normal random variable. And we can convert any 
probability calculation for X into a probability calculation for 7. Sometimes 
that is useful. 


Our first version of the Central Limit Theorem was Theorem|18.16} Here’s 
a version of the Central Limit Theorem with a standardization step built 
in. This theorem is stated using cumulative distribution functions. Using 
cumulative distribution functions lets one talk about functions rather than 
intervals, which may be convenient, but we could easily avoid CDF termi- 
nology. 


Theorem 18.26 (The Central Limit Theorem with standardizing). 
Let X1,...,X, be an IID sequence of random variables. Let S,, = X,+...+ 
eis 

Suppose that each X; has mean pz and variance 0? > 0. Then the mean 
of S,, is np, and variance of S, is no”. 

Let Z be a standard normal random variable. 

Then for any sequence a, of real numbers, 


Og Tift Diy — ripe 


Jno Jno 
We use the notations of Definitions |18.20] and |18.24|in equation (18.42). 


Of course F’'y and Gz are the same function, since 


a— nL 1 a = ( “ae 
Pi Zz < a € 2iag=]PiZ2< : 
( a) V 27 J—oo 7 Jno 


Fs, (Qn) — F( ) — 0 and Gs, (an) — ce( ) — 0. (18.42) 
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Remark 18.27 (Sequences versus ¢ statements). In the statement of 
Theorem [18.26] equation [18.42] is asserted to hold for any sequence dp. 

Readers who are familiar with convergence arguments can easily show 
that this assertion is equivalent to saying that for any « > 0, there exists 
some no such that for all n > no, for every real number a, 


Fale) Fe( $2) ce ea 


Gs,(a) - Ga ) <e. (18.43) 


Jno Jno 


Which formulation is preferable is a matter of taste, but talking about se- 
quences a, seems more convenient for our applications. 


Proving that the two forms of the CLT are equivalent 


We didn’t prove Theorem [18.16] but it’s not hard to see that Theorem [18.26] 
and Theorem are equivalent. 

To show that, let W, = /noZ+np. Then W,, is normal and has the 
same mean and variance as S,,, and, just as in equation (18.41), 


a— ny a-— np 
P(z< Ve ) P(Wa <a), and P(z< ae ) P(W, <a) 


Note that 


Fs, (a) = P(S, < a) and Fe( 2h) = P(z < ae) = P(W, <a). 


Similarly 


a— np a— nL 
G =P(5,< d Gz| —=— ] =P(Z< =P(W, <a}. 
at) = Ee a) ae -( Vno ( Vno ( °) 
It follows that equation (18.43) just states two special cases of equation 


(18.34)! 

On the other hand, if equation holds for some error ¢ then, using 
equation (18.39), it’s not hard to show that the inequality in equation (18.34) 
holds for every interval J with ¢ replaced by 2e. 
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So the two versions of the Central Limit Theorem imply each other. 


Since Theorem is equivalent to Theorem one might wonder 
if there is any benefit in having two versions of the same result. 

A minor benefit is that statements of the Central Limit Theorem in other 
textbooks may be more similar to Theorem than to Theorem 
More significantly, readers might consider which version of the Central Limit 
Theorem can be applied more easily in the following exercise. 


Exercise 18.8. In the setting of Theorem |18.26| suppose that the random 
variables X; have mean zero and variance a”. Let a, be a sequence of points 
of R. Find lim,,,.. P(S;, < a,) in each of the following cases. 


(i) Gn > 0. 

(ii) a, > 5. 

(iii) @, = 2\/n. Your answer in this part should be left as an integral. 
(iv) Let a, be as in part (iii), and let b, = 5a,. Find 


lim P(S;, € [an, bp): 


n->Co 


Example 18.28. In the setting of Theorem |18.26| let a, and b, any se- 
quences of numbers. We would like to know whether 


P(S, < an) & P(Sy < dn) 


when n is large. 
In other words, we would like to know whether or not 


lim (Pts, 2H \=P(S, < i) =0. (18.44) 


Nn—-co 
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Using CDF notation, our question is whether or not 


n—00 
Let ; 
ig ee 


By equation (18.42), 
Fs, (Qn) — Fz(un) 4 0 and Fs, (b,) — Fz(un) > 0. 


Hence 
Fs, (dy) = F'z(Un) + stuff, 
where stuff > 0. Similarly 


Fs (bn) = Fz(vn) + stuff, 
where stuff > 0. Thus 
Fs, (@n) — F's,(bn) = Fz(tn) — Fz(vn) + stuff, 


where stuff > 0. 
It follows that F's, (an) — Fs,,(b,) 3 0 if and only if Fz (un) — Fz(un,) > 0. 
That is, equation (18.45) is equivalent to the statement that 


lim (Falun = Fe(vn)) =0. (18.46) 


Noo 


To see the conditions under which equation (18.46) will hold, we can 
18.12 


contemplate the graph of F'z, shown in Figure 

If up — —oo and vp; — —oo, then clearly equation holds, since 
Fz(un) + 0 and Fz(v,) — 0. The same is true if u, — oo and vu, > oo, 
since then Fz(u,) > 1 and Fz(vu,) > 1. 

Otherwise, the graph of Fz strongly suggests that equation holds 
when (and only when) u, — vp, + 0. This condition is often something that 
can be checked. 

For example, suppose that b,, = a, + 10000000n/4, so that by, — ay, — oo. 


It might not be clear whether equation (18.44) holds. However, it is easy to 
see that u, — VU, — 0, and so equation (18.44) does indeed hold. 
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Exercise 18.9. In Example |18.28]} we claimed that equation (18.46) holds 
when Un — Un — 0. This claim is supported by the graph of the CDF. To 
justify the claim, show that for any u < v, 


Fz(v) = Fz(u) = 


Exercise 18.10 (Expressing the Central Limit Theorem as a stan- 
dardized limit). In the setting of Theorem |18.16]or Theorem |18.26| prove 
that for any real number a, 


lim P (Pane < «) =P(Z <a), (18.47) 


noo on 


where Z is standard normal. 


(Sometimes the Central Limit Theorem is expressed as equation (18.47). 
This form is indeed equivalent to Theorem |18.16] and Theorem }|18.26} but 


seems less convenient for some applications. ) 


Remark 18.29 (Natural scaling). Equation (18.47) seems natural if we 
are trying to picture the distribution of S,,, as n > oo. 

For example, by replacing S,, by the centered random variable S,,—n, we 
are trying ensure that the distribution stays in the picture. By subtracting 
the mean we keep the distribution from drifting off to co or —oo as n > oo. 

However, the distribution of S,,—mny still gets wider and wider as n grows, 
so it is still escaping from our view. To prevent this spreading we standardize 
S, — np, i.e. we divide by \/no, before stating the limit in equation (18.47). 
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18.9 Checking the Central Limit Theorem for 
another binomial distribution 


Recall that we started our discussion of the Central Limit Theorem by noting 
the shape of the graphs in Figures and These graphs are 
only a tiny example, since they only deal with binomials for which p = .5. 
Theorem of course applies to every binomial distribution, and to an 
enormous zoo of other distributions. 

Although we won’t do much testing of the Central Limit Theorem, we 
can at least try another binomial distribution. Let’s graph some binomial 
distributions for p = .99. This value of p favors success in an extreme way, 
so it certainly changes the shape of the binomial distributions. The Central 
Limit Theorem asserts that, with proper scaling, the binomial distributions 
will still become approximately normal as n grows large. 

Recall the interval [¢(n), u(m)]| defined in Terminology[18. 1] which contains 
most of the probability in the distribution. With p = .99, we compute that 
pt = 99 and [€(n), u(n)] = [95, 100] when n = 100, pp = 990 and [é(n), u(n)] = 
[979,998] when n = 1000, 4 = 9900 and [é(n), u(n)| = [9868, 9929] when 
n = 10000. 

Graphs showing the main parts of the binomial distribution, for p = .99 
and n = 100, 1000, 10000, are given in Figures and 

Things look bad when n = 100! However, we can see that the shape of 
the graph seems to be somewhat similar to the shape of a normal density 
when n = 1000, and is more similar when n = 10000, in rough agreement 
with the Central Limit Theorem. 
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Figure 18.13: Main values of P(S, = k) for n = 100, p = .99. 


0.12 5 
0.10 5 
0.08 5 
0.06 5 
0.04 5 


0.02 + 


0.00 + 
980.0 982.5 985.0 987.5 990.0 992.5 995.0 997.5 


Figure 18.14: Main values of P(S,, = k) for n = 1000, p = .99 
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0.040 + 


0.035 4 


0.030 4 


0.025 + 


0.020 + 


0.015 4 


0.010 + 


0.005 + 


. ° 
. . 
° ° 
eooe® Cee 


0.000 4 


14 T T T iy T T 
9870 9880 9890 9900 9910 9920 9930 


Figure 18.15: Main values of P(S, =k) for n = 10000, p = .99 
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18.10 Sums of independent normals 


This section assume knowledge of the convolution operation for functions, 
defined in Section 


Definition 18.30 (The mean zero normal densities). In equation (18.23) 
we defined the density 7, the standard normal density with mean zero and 
variance one. 

For any t > 0, define 7, by 


1 1 x? 
ny oR (18.48) 


l\= e€ 
By Corollary |18.6} 7, is a density for a random variable with mean zero and 
variance t. 
Of course, 7, = 1. 


Lemma 18.31. Let Z, be a random variable with mean zero and variance 
a, and let Z be a random variable with mean zero and variance b. Suppose 
that Z, and Z, are independent. 

Then Z, + Z is normal, with mean zero and variance a+ b. Also 


Na * 1 = Na+: (18.49) 


Here nq * 7 is the convolution of the functions 7, 7p. 


Proof. As noted in Definition [18.30] 7, is a density for Z, and m is a density 
for Zp. 

Therefore 7, * 7 is a density for Z, + Z, by Section 

We will perform a calculation to show that 7, * 7% is a normal density. 
This will show that Z, + Z, is normal, and the rest will follow. 

By equation (J.13), 


ioe) 1 12 1 (z-t)? 


a * M(Z) = 2a e 
Ma * Mo(2) oo V 27a V/2rb 


e 2a 
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Thus 
1 1 f* _#_@-# 
Na * Nol Yo Joak ee 2 dt (18.50) 
1 1 ee ae: 
= oe oa a 8 
where 2 (@-y @ 1 
0S Go te Ga (2? — 22 +17). (18.51) 
Thus 


2b b 


where the “stuff” does not involve ¢, and is a quadratic polynomial in z. 
After completing the square for the variable t, we see that 


1 1 
v=0(S +5) —=+4 stuff, 
a 


v = a(t — az)? 4+ stuff, 


where Qj, @ are constants, a; > 0 and the “stuff” does not involve t, and is 
a quadratic polynomial in z. Thus 


te dt =e~ he dt. 


Letting s = t — agz, we can see that 


pees dt — ee ds = as, 


where a3 is some constant which does not depend on z. 

So we know that 

Na * Nb = jee eh. 

for some constants k2, k,, ko, d. We also know that the convolution of proba- 
bility densities is again a probability density, by Section [J.5] And so d cannot 
be zero, since the integral of a density must be equal to one. 

If ky were not a positive number, ten de—*2—k12-ko dz could not exist. 
But it does, and so we conclude that kz > 0. 

By Lemma|18.8| de—*2?°—k12—ko ig a normal density, i.e. Na * Np is a normal 
density. 

Hence Z, + Z, is a normal random variable. 
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Since Z, and Z, are mean zero random variables, Z, + Z has mean zero. 

Also Var (Z, + Z)) = Var (Z,) + Var (Z,) = a+. Since we know that 
the mean and the variance of Z, + Z, match the mean and variance of q+», 
we know that the distribution of Z, + Z, has density a+. 


Lemma |18.31] implies: 


Theorem 18.32 (A sum of independent normals is normal). Let X 
and Y be normal random variables. If X and Y are independent then X + Y 
is normal. 


Proof. Lemma [18.31] takes care of the mean zero case. 

Thus X — E[|X]+ Y — E[Y] is normal. 

The random variable X + Y is obtained from X — E[X]+Y — E[Y] by 
shifting (adding E|X]+E[Y]). Since a shifted normal distribution is normal, 
X +/Y is normal. 


Why should we have expected that Theorem |18.32] holds? 


Much as in the case of the Poisson distribution (Lemma|17.6), our picture of 
the normal distribution as an approximation to a sum of independent random 
variables suggests that X + Y must be normal, even without a proof. One 
can use the following argument. 

We can think of X as statistically similar to a sum of many small inde- 


pendent random variables U;,..., Un. 

Similarly we can think of Y as statistically similar to a sum of small 
independent random variables Vj,..., Vin. 

We can imagine that the whole sequence U,,...,Un, Vi, ..., Vm is indepen- 


dent. This is because we can think of measuring values of X for repetitions 
of an experiment, and measuring values for Y for repetitions of a completely 
different experiment. The physical motivation for the Central Limit Theo- 
rem tells us that the distribution of Uj +...+U, +V, +...+V,, should be 
approximately a normal distribution. Thus the statistics for X + Y should 
be that of a normal distribution. 
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18.11 Solutions for Chapter 


Solution (Exercise |18.1). 


(i) 
vo [ir gerand [vera 


provided that the integral exists. We show the integral exists and evaluate 
it at the same time: 


oo 1 b 
: yred dy = lim yre¥ dy = — lim i y° Qye¥ dy 
0 boo 0 2. b00 0 


1 a a 
= lim -=|-ye | + | e% dy 
boo 2 0 0 


1 ~ 2 1 7 2 VAs 
=O+if eVdy= 5 f eVdy=. 
t5fe y ifs y Fi 


(oe) 


b 


We used integration by parts to obtain the second line of the equation. 
Combining our facts, Var (Y) = 2 = = 172, 


(ii) By equation (15.5), 


1 2 
see iy= fy et ys [ y—e ” dy. 
vi= fs “Vi Va 
A trivial change of variable shows that 
0 lo) 
1 2 1 2 
y—=e * dy = -{ y—=e * dy, 
J Vr 0 JT 
so equation (18.5) holds. 


(iii) 
By equation (15.5), 


= 1 , i {7 > il s r 
Ely |] = ee a ee a ee QyeY d 
¥'|] i. ly| a = | ye ag en | ue ay 
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Solution (Exercise |18.2). 


(i) For an interval |a, bd}, 


P(W € |[a,b]) = P(X +meé [a,b]) = P(a< X+m<b) 
=P(a—m<X <b—m)=P(X € [a—m,b—m)) 


=f" reyae =f sw— ma 


By Remark/|9.11| f(w—m) is a probability density for the distribution of W. 


(ii) Suppose that c > 0. For an interval [a, 0], 


P(V € |[a,b]) = P(cX € [a,b]) = Pia < cX <b) 


_fisow=}f na 


By Remark |9.11) [9.11] ( (1/c)f(cv) is a probability density for the distribution of 
V. 
Suppose that c < 0. Then |c| = —c. For an interval |a, J, 


a aa [@,5)) = Pia < cX < B) 


= ” 1 f pant tf nan 


By Remark (1/|cl)f(v/c) is a probability density for the distribution 
of V. 
(iii) Let g(v) = (1/Iel) F(v/c). 

By part (ii), g is a probability density for the distribution of cX. 

Then by part (iii), g(u — m) is a probability density for the distribution 
of cX +m. 

And g(u—m) = (1/|el) f((u— m)/c). 
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Solution (Exercise |18.3). 


xX — 1 1 1 
Var MY) = —Vvar (X =p) =— Var) =o = 1, 
o o oe o 


B|=—| = SB[X — yp] = = (u—) =0. 


Solution (Exercise |18.4). Let X be standard normal and let n be an odd 


nonnegative integer. We must show that E[X"] = 0. 
By equation (15.5), 


we 1 2 ” 1 2 = 1 2 
Bix” =) ye ax = f 2? —e* ax + | 2" —-e * dr 
oe Paes iw Vi » Va 


A trivial change of variable shows that 


a See i 
i a de = — f xy Wag dx, 
so E[X"] = 0. 
Solution (Exercise [18.5). 


(i) By Exercise Z is standard normal. Hence a density for the distri- 
bution of Z is 7, where 


(ii) 
=p 


P(X — p > a) = P(| ><) = P(|Z| >c). 


Using the symmetry of the density for Z, 


22 2 
Sincee 2 < ze 2 


Cc 
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(iii) Equation(16.24) says that 
1 
P (|X =p] > 00) <= 
c 


2° 


So we are asking which is a better bound: 


As co, 


(and does so very rapidly), so Y=e~ = is a much sharper bound when c is 
large. 


Solution (Exercise |18.6). 


2 2 oe 4 
Ei |= = 
[x] 3766 ® 
a 4). ¥ 

B[x7) =244, 5 _ 33 
3°66 6 2 

if iy. 2 

Var (X;) = = 
C= 3 (5) 4 


Each X; has standard deviation o given by 


V21 


2 


We will try to find an approximation for each S,. 
By additivity, E[S,] = $ 


By equation (16.35), Var (S,,) = 23. 
Let W,, have a normal distribution, with Var (W,,) = an and E[W,,] = §. 
Because of the Central Limit Theorem, we know that for large n we have 


P(S, < 5200)  P(W,, < 5200). 
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We hope that n = 10000 will give a reasonable approximation. 
Let h be the probability density for the distribution of W,, with n = 


10000. Then 
5200 


P(W,, < 5200) = / h. (18.52) 
To finish the problem, we need to know the formula for h. 
Let m = E[W,] = E[S,] = 5000, and let v = Var (W,,) = Var (S,;,) = 


2500 - 21. By Lemma /}18.20} 


1 1 2 
Ay) = ee (ur n)?/20 
A Apr 


This completes the solution. To obtain the actual numerical value of 
P(W, < 5200) one can use a computer program to perform the numerical 
integration. This gives 


—(u—5000)?/(2-2500-21) ca, — 0) 8086334555573861 


5200 5200 l I 
[oo /. varave 
Solution (Exercise |18.7)). Since 
{X <a}U{a<X <b} ={X <3}, 
and the union is disjoint, 
P(X <a)+P(a< X <b) =P(X <b). 


Equation (18.37) follows. 


Solution (Exercise |18.8). Note that P(S, < a,) = Fs, (an). By Theo- 


rem [15.20 
lim (F(a) = Fe(® — a) =, 
n—co no 


In this problem pz = 0, so 


. an —_ 
Jim (F(a) — Fe( =) ) =. (18.53) 
Also, since 
1. of . 38 
Fz(a) = = | ~~? az, 
ga a 
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F is a continuous function. 


(i) When a, > 0, Jaz >? 0. Hence 


Thus I 
Fs, (dn) = 2 
(ii) When a, +5, 42; + 0. Thus as in part (i), 
1 
Fs, (dn) => 2 
(iii) When a, = 2/n, 7a = 2. Hence 


F's,,(Qn) > F: (=) : is as dz 
Op, — |] =— 
Sn Z\ = oes ae 


(iv) The same argument used in part (iii) shows that 


10 
10 1 oe 
Fs, (bn) + F:(2) => a fie 2 dz. 
Similarly 
Caan (=) : Sd 
An — | =— e 2 az, 
Sa Z\ 5 a I 
Thus 


P(Fs,, € [an, bn] = Fs, (bn) — Gs, (@n) > Fe( =) ~ Gz (5) 


10 
1 ° =a 
=== e- 2 dz. 
Tat 
Solution (Exercise |18.9)). 


18.11. Solutions for Chapter 


Zz 
Sincee 2 < 


Solution (Exercise |18.10)). Define 
On = /noa+ np. 


By equation (18.42) (which was an immediate consequence of Theo- 


But for each n, 


Thus 
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APPENDICES 


These appendices contain additional details about topics discussed earlier. 
Appendices [I] [J] [K] and [L] also introduce subjects that are not covered in this 
book, but which readers may encounter later. 
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Appendix A 


Some practice with averages 


Averages occur throughout probability theory. This section reviews the gen- 
eral concepts (Definition Definition and gives some exercises to 
illustrate the properties of averages. Readers might benefit by sampling the 
exercises and testing the statements against their own intuitions. 


Definition A.1 (Linear combinations). Suppose that numbers v1, ..., Un 
and a1,...,@, are given. The expression 

ayVy +... +GnUn (A.1) 
is said to be a linear combination of v1,...,Un, using coefficients ay,...,Qn- 


In this section, we consider the special case in which the coefficients 
dj,.-.,@n, in equation (A.1) are nonnegative. Nonnegative coefficients will 
be referred to here as weights. 


Definition A.2 (Weighted sums and averages). Let v),...,v, be num- 
bers which we will call the values, and let w),...,w,, be nonnegative numbers 
which we will call the weights. The “weighted sum” of the values v;, using 
the weights w;, is 


i=1 


When the numbers w; add up to one, we say that the weights are normalized, 
and in this case the weighted sum in equation (A.2) is called the “weighted 
average” of the values 1. 
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Any average is also called a mean. A weighted average in which all the 
weights are equal is called the arithmetic mean. Since the weights must add 
to one, in this case each weight w; must be equal to 1/n, where n is the 
number of weights. Thus the arithmetic mean can be calculated by dividing 
the sum of the values by the number of values. This is what is usually meant 
by the word “average” in ordinary speech! 


For brevity, we sometimes use an overline to denote an average value. 
Thus given some values v1,...,Un, and weights w1,..., Wy, the weighted av- 
erage of v1,...,Un, might be denoted by wv. 

By definition, weighted averages always use normalized weights, but it will 
be convenient to extend the terminology about weighted averages slightly. 


Suppose we are given weights w1,...,W, which are not normalized. Let 
W =w,+...+wp. Then w,/W,...,w,/W are normalized weights which are 
proportional to w,,...,W,. To save words, if we are given values v1,...,Un 
and unnormalized weights wy ,...,Wn, we will say that 
= Wi Wn 
B= yt... + =v, A.3 
Dnt to (A.3) 
is “the weighted average with weights w1,..., Wn”. 


Thus, in a problem, if a weighted average is requested, and the given 
weights are not normalized, it is understood that the weights should be nor- 
malized before calculating the average. 


Exercise A.1. A sequence x = (21,...,25) of values and a sequence w = 
(w1,...,Ws) of weights is given, such that 7; = y= ; eS j= 
t= 5 and wy =F, w= 


O IF 


= 2 w3=3,ws=2,w5 =2. Find the weighted 


average of the values in the sequence x. 


Example A.3 (Mean value equals center of mass). Anyone who has 
seen a center of mass calculation in a physics course has encountered a 
weighted average. 

Suppose we have seven point masses on a line. The position coordinates 
of the masses are vj = —1.0, vo = —.4, v3 = .1, vg = 1.5, v5 = 2.2, vg = 3.2, 
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U7 = 3.7, while the weights of the masses are w, = 1, wa = 3, w3 = 3, w4 = 2, 
w5 = 1, we = 1, w7 = 3. 

The sum of weights is W = 14, and the center of mass coordinate is 
defined to be the usual weighted average 0, which is given by 


Wi W2 ; W3 W4 ; W5 WE | W7 _ 17.6 


v= v1 U6 Uz = 14 


= 1.25714. 
WwW 


(A.4) 
See Figure We can think that each weight w; is attached to a rigid 
bar at v;. If the bar is free to turn about a pivot located at 0, it will balance, 
and remain stationary. 
When X is a random variable with values v1,...,v7, and w; = P(X = 2;), 
then W =w,+...+w7=1. Then equation says that 


V=WjiU+.. - W7U7, 


and this sum is E|X], by definition. In the same way, for any finite-range 
random variable X, E[X] is equal to the center of mass of the distribu- 
tion of X, provided that we represent the distribution by putting a lump of 
probability mass equal to P(X = 2;) at each point 7;. 


Exercise A.2 (Average of a constant). Let v1,...,U, be values, such 
that each v; is equal to the same number c. Let wy ,...,wW, be any sequence 
of weights. Show that the weighted average v1,..., Un is equal to c. 


This exercise is not hard, but it is a useful observation. 


Another simple but useful property of averages is the following. 


Lemma A.4 (Scaling and bounds for averages). If all the values in a 
sequence are multiplied by the same number c, then any weighted average is 
also multiplied by c. 

Any weighed average of a sequence of values always lies between the 


smallest value and the largest value. That is, if vj,..., vn are the values and 
W1,.--+,Wn are the weights, then 
Mi (i524. 0y) <0 S Mak (iy. Wy) 2 (A.5) 
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Wi We W3 W4 Ws We W7 

* oe * * +—e 

V1 Ug U3 t V4 Us U6 U7 
v 


Figure A.1: 0 is the center of mass for the seven masses 


Proof. Let W =w,+...+wpy. The first fact in the statement of the lemma 


just says that 
La 1 
a > WCU; = Cz 3 WiUj- 


To derive the second fact, let m = min (vj,..., Un) and let M = max (v1,..., Un). 
Then 

2s. Le Le 

T= op Lwin s 2 


Similarly v > m. 


If it happens that all the values v;,...,U, are equal to the same number c, 
then the maximum and minimum of the sequence are both equal to c. Thus 


Lemma [A.4] implies the result of Exercise 
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Exercise A.3 (Replacing values in a weighted sum by a constant). 
Let v1,...,Un be real numbers and let wj,...,w, be weights. Let 


M =wyv,+...+ Warn; 


and let 
ke Wi Wn 
Ui ke er a 
where W is the sum of the weights. 
Show that 
M = wy +... + Wy. (A.6) 


Using the terminology of weighted sums and averages, this exercise tells us 
another useful fact: 

“The value of a weighted sum is unchanged if all the values are replaced 
by their weighted average.” 

Note carefully that equation holds even if the weights wj,...,Wn 
are not normalized, but the weighted average v is of course always calculated 
using normalized weights obtained from wy,..., Wn. 


Exercise A.4 (Replacing some of the values by the average of those 


values). Let v1,...,Um4n be real numbers and let wy1,...,Wm4n be weights. 
Let s be the weighted sum of v1,...,Um+n, using the weights w1,...,Wmsn- 
Let z be the weighted average of v1,...,Um, using the weights w1,..., Wm. 


Note carefully that the definition of z does not involve any of the values other 
than vj,...,Um- 

Let x; = z fori=1,...,m, 2; =v; fort =m+1,...,m+n. 

Prove that the weighted sum of 21,...,2%m4n is equal to s. 


Exercise A.5 (Replacing some values by the wrong average). In the 
setting of Exercise where we are given values v1,...,U, and weights 
W ,.--,Wn, again let the weighted average be v. Suppose we replace v, by U 
but leave v2,...,U, unchanged. Show by an example that the value of the 
weighted sum may change as a result of this substitution. 
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Exercise A.6 (Averaging pooled data). This problem contains some im- 
portant techniques if you have to work with averages. 
A sequence v of data has length 4700. Denote the arithmetic mean of v 


by w. 
Let 
x = U1, ates ; V1500; 
Y = V1500+1; - « - » U3000; 
Z = V3000+1; +++» U4700- 


We might refer to x,y,z as “blocks of data”. 

Denote the arithmetic means of x, y, and z by Z, y and Z respectively. 

Suppose you are given Z, y and Z. 

Derive a formula for 0 in terms of %, y and 2. 

One sometimes says that the data sequence v is obtained by “pooling” 
the data in sequences x, y and z. The answer to this problem says that the 
“pooled average” of the data can be calculated as a weighted average of the 
averages of the three blocks of data x, y, z. 

This illustrates a small theorem, which you are finding in this exercise. 
It comes with a slogan: 


“The average of the averages is the average.” 


Exercise A.7 (A weighted average for pooled data). Consider the 
same data sequences v,x,y,z studied in Exercise The rule given in 
that exercise generalizes to weighted averages, as you will now show. 

In the present situation suppose that you wish to find the weighted average 
of v, using weights w1,..., W470). You are not told the weights w;, but you 
are told numbers r,s,t, where r = w, +... W500, § = Wi500+1 + --- + W300; 
and t = w3o9041 + --- + Wazo0- 

Using the weights w;, let v be the weighted average of v, and let 7, y and 
Zz be the weighted averages of x, y, and z, respectively. 
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answer. 


Exercise A.8 (Weighted sum of a sum of sequences). Let wi,...,Wn 
be a sequence of weights. Let 271,...,2%, be a sequence of values, and let 
Y1,-++,Yn be a sequence of values. 
Let s be the weighted sum of 21,...,2%,, using the weights wy,..., Wn. 
Let t be the weighted sum of y1,...,Yn, using the same weights wy,..., Wn. 
Prove that the weighted sum of 2, + y1,...,2% + Yn, using those weights 
W1,.-++,Wn, iS equal to s +t. 


Of course, since a weighted average is simply a weighted sum using nor- 
malized weights, you have also shown the following. If z is the weighted 
average of the x; and y is the weighted average of the y;, then the weighted 
average of the numbers x; + y; is equal to %+ y. 

One might write this rule as 7 Fy =%+ 4. 

And one might say: “The average of a sum is the sum of the averages.” 


Remark A.5 (A danger when comparing averages). Here’s an example 
of a problem. We’ll phrase it in terms of batting averages in baseball, but it 
can come up in other situations, for example in testing medical procedures. 

A baseball player’s batting average, over a given time period, is defined 
as the fraction of the times at bat which actually result in a hit. 

Suppose you are comparing two players, named A and B. You don’t know 
the details of their records, but you feel, quite reasonably, that their batting 
averages will give you a good idea of their abilities as hitters. 

But suppose you don’t learn their batting averages over the entire season, 
but instead you are given their averages over the first half of the season, and 
then separately are given their averages over the second half of the season. 

Suppose that player A has a better batting average than B over the first 
half of the season, and also has a better average than B over the second half 
of the season. 

Can we conclude that A has a better average than B over the whole 
season? One might jump to that conclusion after reading Exercise|A.6] which 
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gives a formula for combining averages from different sets of data. However, 
in this case, there is an important extra piece of information: different players 
may have different numbers of times at bat. 

Suppose, for example, that neither player A nor player B did particularly 
well during the first half of the season, and both had very similar records, with 
A slightly better. And suppose that player A had a great batting average 
during the second half of the season, but had very few times at bat during 
that period. In the second half of the season, player B had a batting average 
that was very good, and had a large number of times at bat. 

When combining the averages for these players, player B’s very good 
average during the second half of the season will correctly receive much more 
weight than player A’s great average. As a result, player B may have the 
best overall average. 


A.1 Solutions for Appendix 


Solution (Exercise |A.1). Let 
WwW = w+ We + W3t+ Wat Ws. 
Then 
B= (@1- Wi + Fo: We + 73-W3 +24: Wat L5- Ws)/w & 0.641509433962264. 
Solution (Exercise |A.2). Let 
w=wyt...+ Wn. 


Then 
Wy Uy t... + Wy Un W1yCE+...+ Wy WC 
w W W 


Solution (Exercise |A.3). By definition, 


= Wi Wn 1 M 
t= Witt yn = py witi +... + Wan) = 
Thus Wo = WM. 
Hence 


A.1. Solutions for Appendix 


Solution (Exercise |A.4). Let K = w,+...+ Wm. Then 


so Kz = wv, +... + WmUm. 
Then 


Wt +.. -+Wmintmin = (Wiz +... Wm2zZ)+(WmsiUmsi +--+. + WmtnUmtn) 


= Kz+ (WmsiUmii t--- + WminUmtn) 


= (wy +... + WmUm) + (WmtiUmsi +... + WmtnUmin) - 


Solution (Exercise |A.5). Let w; = w2 = 1/2, and let vj = 1, v2 = —1. 


Then 0 = 0 and wiv) + wove = 0. 
Replacing v; by v, the weighted sum becomes 


1 
w10+ wev2 = ~ 3" 
Solution (Exercise |A.6). 
z= U1 +... + V1500 
7 1500 
_ _ Visoi+--- + U3000 
7 1500 
_ U3001 +--+ + Va700 
1700 
Then 
Vy +... + Va4700 1500% + 1500y + 1700zZ 1500. 1500. 1700. 
4700 4700 4700 | 4700" ' 4700 


Solution (Exercise |A.7). By definition, 


W1V1 +... + W1500V1500 
: : 


Sl 


_ W1501V1501 + - -- + W1501U3000 
: 


| 


__ W30013001 + - - -  W4700V4700 
; 


XI 
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Hence 
W1U, +... + Wiso0V1500 = TZ, 
W1501V1501 + --- + W1501V3000 = SY, 
W3001U3001 +... + Wa700V4700 = #2. 
Then 


W1U1 +... + Wa4700V4700 
W1 +... + Wa4700 
(wiv1 +... W1500V1500) + (W1i501V1501 + -. - + W3000V3000) + (W3001U3001 + - . - + Wa700V4700) 


YE Sat 
_ rat sy t+ tz r _ 8 _ t 


= x4 y+ Bs 
r+st+t ‘ie ge: Sea r+stt Peas 


Solution (Exercise |A.8). 


n 
s= ) Wi Li, 
i=1 
n 
i=1 


Adding these equations, 


s+t= So win + So wii = So wi (x; + yi). 
i=1 i=1 i=1 
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The triangle inequality 


Proving the triangle inequality The triangle inequality says that for 
any numbers 2, y, 


Jc + y| < |x| + ly|- (B.1) 


Proof. By definition |z| = x if x > 0, and |z| = —z if c < 0. Thus either 
la+yl=ax+yorl|a+y| =—-ax-y. 
The definition of |x| tells us that 
x< |x| and —2 < ||. (B.2) 


Adding the inequalities x < |x| and y < |y| gives x + y < |x| + ly]. 

Adding the inequalities —x < |x| and —y < |y| gives —x — y < |x| + ly]. 

Since |x + y| = x+y or |x + y| = —z — y, in every possible case we have 
Jz + y| < [2] + yl. 


Equation (B.1) deals with a sum, but the triangle inequality also applies 
to differences: 


Jz —y| = |x + (-y)| < le] + |-yl = Ie] + [yl (B.3) 


Why is the triangle inequality called “the triangle inequality”? See Fig- 
ure This picture shows that the length of any side of a triangle is less 
than or equal to ) the sum of the lengths of the other two sides. In vector 
language: |/a + b|| < ||a|| + ||b]]. 

It’s interesting that the triangle inequality works for vectors, not just 
numbers. 
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a] 


Figure B.1: The sum of two geometric vectors. 


Exercise B.1. For readers who would like more practice with absolute val- 
ues. 


In equation (B.1), replace x by z — w and replace y by w. Use the result 
to show that 


Jz] — |w| < |z—w]. 

Do this again, with the roles of z, w reversed. 

Then obtain: 

[lz] — wl | < |z-wI. (B.4) 

This inequality is sometimes useful. If you think about z—w as a change in 
some quantity, this inequality says: “The change in the absolute value is no 
larger than the absolute value of the change.” 

You would expect that to be true, but it’s nice to have a guarantee. 


As usual, once we have the triangle inequality for the sum of two numbers, 
we can get a similar inequality for any sum of numbers: 


lay +... + 2%| < [ai] +... + laze]. (B.5) 


To establish equation (B.5), one can use the Old Induction Trick (Exer- 
cise |2.23). Another approach is to reorder the terms, in order to separate 
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the positive and negative numbers in the sum. So write 71 +...+ vp as 
Yi t...+ Ym — (41 +...+ Zn), where all the numbers y1,...,Y%m,Z1,---52n 
are nonnegative. Then use the triangle inequality for the sum of two num- 
bers: 


Yt... + Ym — (1 +--+ 2n) < lyr te. + Yym| + lea te. + Zn) - 


And since now we have nonnegative numbers, |y; + ...+Ym| = Yr t---+Ym; 
Jzr +... + 2n| = 21+... + Zn. 


B.1 Solutions for Appendix 
Solution (Exercise |B.1}). The requested substitution produces: 
(zw) + wu < |2z— w+ |e. 


Thus 
Z| S [2 =| + |e, 


and so 
|z| — |w| < |z— wv. 


Exchanging z and w gives 
|w| — |zl < Jz —w. 


And one of the numbers |z| — |w|, |w| —|z| is equal to | |z] — |w||, so we have 
obtained equation (B.4). 


If you like, you can also check directly that equation always holds, 
as follows. 

It is easy to see that equality holds in equation if either of z and w 
is zero, or if both have the same sign. 

Suppose that z > 0,w <0. Then z-—w = z+ |w| = |z| + |w|, while 
| |2| — Jw] | = £ (J2| — Jw). Then 


|z — w|—|]2] — Jw] | = 2[w] or 221. 


Thus equation (B.4) holds. A similar argument works if z < 0, w > 0. 
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Appendix C 


Defining Z with a given 
distribution density on the real 
line 


Suppose that we are given a probability density function h on R. We would 
like to construct a random variable Z whose distribution is given by h. And 
we would prefer to make Z as simple a possible. 

Here’s how to do that. 

Let Q = R and let the probability distribution P on Q be given by the 
density function h. Define the function Z on R by 


VACHE as (C.1) 


We claim that Z is an example of a random variable with probability 
density h. 
To check that, note that from the definition of Z, 


{ZESt={u: Zu)eSb={u: we S}=S. (C.2) 


Thus P(Z € S) = P(S). Since P is given by h, 
P(ZES)= | h. 
s 


Thus h is a density for the distribution of Z, as desired. 
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The expected value of y(X) 
using the density of the 
distribution of X 


Our goal here is to derive equation (15.5). The main idea is to move the 
whole problem to the real line. 

We are studying some random variable X, and h is a density for the 
distribution of X. 

Let Z be the random variable constructed in Appendix |C| 

Appendix [C] tells us that X and Z have exactly the same distribution. 

Now suppose that y is any function on the real line. 

We claim that the distribution of y(Z) is exactly the same as the distri- 
bution of p(X)! 

To show that, let S be a subset of R. We have to show that the probability 
of the event {y(X) € S}, on the sample space of X, is exactly the same as 
the probability of the event {y~(Z) € S}, on the sample space of Z. 

So let T = {u: y(u) € S}. Both S and T are subsets of the real line. 

From the definitions, 


{e(X) € S} = {X €T}. 


Also from the definitions, 
{p(Z) € S}={ZeET}. 


Since X and Z have the same distribution, the probability of the event 
{X € T} is exactly equal to the probability of the event {Z € T}. 
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But then the probability of the event {y(X) € S} is exactly equal to the 
probability of the event {y(Z) € S}, as claimed. 


By equation (15.3), 
B[y(2))= f o(Zn= f ” o(t)h(t) dt 


By part (ii) of Theorem |15.2| the expected value of a random variable is 
determined by its density, so p(X) and y(Z) have the same expected value, 


and equation (15.5) holds. 
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Practice using densities 


There are no new ideas in this appendix, just some practice to get a feeling 
for the way densities work. 


Example E.1 (Probability of missing the central region - density 
case). Someone is throwing darts at a target represented by a disc of radius 5, 
centered at the origin of R?. See Figure [3.4| 

The point of impact (x,y) is random, but the thrower is trying hard to 
hit a point near the center of the target. Thus the probability density for the 
distribution of impact points is not uniform: it is described by a probability 
density f on the target which is large near the center. 

In fact, we will use a model with f defined by: 


C C 
Jere © 


where r is the distance of the point (x,y) from the center of the target. 

Let A be the set of points (x,y) in the target such that /x? + y? > 2. 
A represents the physical event that the dart lands more than two units of 
distance from the center. In Figure A is the shaded ring. 

We will use equation to find P(A). Fortunately, in this example 
we can calculate integrals using polar coordinates. 

The first step is to find c. 


27 So 
i=/r-| | —rdr dé =10r. 
0 oT 
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Hence c = a Thus 
TT 


-[i-[- [amie w= = (2r-3) = 


Exercise E.1. In the dart-throwing experiment, let h(x, y) be the probability 
density for the random location where the dart lands. If A is a region of the 
target board, then the probability of hitting A is given by 


P(A) = [io (E.1) 


2 f [vor indy 


Consider the situation when the target region T' is a circular disc with 
radius one centered at the origin, and assume that the thrower has a ten- 
dency to throw toward the right. More precisely, assume that h(x, y) on T 
is proportional to 2+ x. 


In calculus notation, 


(i) Find the exact formula for h(a, y). 


(ii) Find the probability that the dart lands in the right half of the target. 
That is, if A= {(z,y): (,y) € T, x > 0}, find P(A). 


Exercise E.2. In this problem we have a square target. 

Let Q be the rectangle consisting of all points x,y such that 0 <2 <1 
and0Q<y<3. 

Let h be a probability density on Q given by 


hé.y)—cxsm(29), (E.2) 


where c is an appropriate constant. 
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Let P be the distribution on 2. with probability density h. 

Consider choosing a random point in 2? using this distribution. Let A be 
the event that the chosen point (x,y) is such that y < 1. 

Find P(A). 


Exercise E.3. Consider the probability model with sample space [0, 7/4] 
and probability density f(t) = V2 cost. 

Let X be the random variable on [0,7/4] defined by X(t) = sint. Find 
E [Xx]. 


Remark E.2 (Are unbounded densities ok?). By definition, any non- 
negative function f with [{ f = 1 is a probability density. So a probability 
density f is not necessarily a bounded function. Theorem does tell us 
that E |X] is always defined if X is a bounded random variable. But what 
about { Xf? Does that integral always exist when X is bounded? Suffi- 
ciently paranoid people (like us) might worry: if f is unbounded, could that 
spoil [ Xf, and contradict equation (15.3)? 

Fortunately not, because of the Comparison Principle for integrals. Since 
| f is defined, [cf is also defined for any constant c. And if X is a bounded 
random variable, by definition this means that for some value of c we have 
|X| <c. But that implies that |X f| < cf. And so the comparison principle 
for integrals guarantees that { Xf exists. 


Exercise E.4. Let ( = (0, 4]. 
Let f be the probability density on the interval [0,4], such that f(t) = 
c/y/t. (An unbounded density) 
Let P be the probability set-function on [0,4] with probability density f. 
Let X be the random variable on (0, 4] defined by X(t) = t7!/4. 
Find c, and then find E [|X]. 
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Exercise E.5 (Finding the expected value of the first coordinate). 
Consider the setting of Exercise 

The sample space 22 is the unit disc T. Let X be the mathematical 
random variable defined on T’ by the equation 


X((x,y)) =. 


The physical interpretation of X is that it represents the x-coordinate of 
the spot where a dart strikes the target. 

By assumption, a density for the probability function P on T is given by 
h, where 


h(,y) = (2 + 2) 


for an appropriate value of c. From the solution to Exercise we know 
that:.c= 1/(27). 
Find E[X]. 


Exercise E.6. Let 2 be the disc with radius 5 centered at the origin. Let 
H be the function on Q defined by H((x,y)) =e" =eV™*®. 


(i) Consider the probability model with sample space 2. and uniform prob- 
ability distribution P on Q. Find E[H]. 


(ii) Consider the probability model with sample space 2 and probability 
distribution P as in Example so that P has a density given by 
oy: Find E [A]. 
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E.1 Solutions for Chapter 


Solution (Exercise |E.1). (i) We are told that h(z,y) = c(2+ x) for 
some proportionality constant c. 
We know that h must have integral equal to one, since it is a probability 


density. Hence 
J [ee+sayas = 
T 


| [evae=a. 
f 


since this integral is the area of the unit circle. 


Also 
| [etvae= 0 
- 


by symmetry! Applying these facts gives 2c + 0 = 1, so c = 1/(27), and 
h(x, y) = a,(2 +2). 


(ii) Using the formula for h, and equation (E.1), 


P(A) = sf [@ra)dvee 
A 


Since A is the right half of the unit circle, 


[ [raves =. 
A 


We note that 


Also 


1 1-2 1 1 . a 
| [eduae= | / vdyde = f QeV1 — ade = —= (1—2°)*” = 
i 0 J—Vi-a? 0 3 , 


Hence 


2 
3" 
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Solution (Exercise E.2) . Since [oh =1, 


t= [) Prersienarte= [e(—onten 
= (1- fos 


3 


, dx = ef (1 —cos(3.x)) dx 


})=<0- 82), 


Thus 


pa)= f° [eesitenanae= ['e( ester 


; i= (1 —cos(x)) da 


1 


1 — sin(1 
=¢ ( — sin(«r) = ¢(1—sin(1)) = ten) 
1 _ sin 
0 3 
Solution (Exercise |E.3). By equation (15.3), 
7/4 Js 1 ; m/4 1 ; 
E |X Sate) (sint)(V2 cost) dt = —=sin*t} = —=sin(7/4)°—0 
. Bl, 
1 (L ) - 1 
ga) oe 
Solution (Exercise |E.4). 
be 4 
i= — dz = 2cV/z| = 2c(2 —0) = 4c, 
0 Vz 0 
soe=1/4. 
By equation (15.3), 
oe ae 1 geal 
E[x|= | rca=5 f t-3/4 dt = —4¢/4] = 41/4_Q = V2. 
3 4fi- 4d; 4 |, 


Solution (Exercise |E.5). By equation (15.3), 
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1 Vi-2? 
BIX]=5-f [Xe 2+2)dyde = 5 | (er) dv 


1 1 V1—a? 1 1 V1-a? 
=i / x (2+ 2) dy da = 5 | ae / (2+ 2x) dy } dx. 
Ai Jf ¥. \ h/t ae a Afi 


Hence 


E[X] = 2 [ 2e(2 + x)V1 = ade. 


Nowadays one can use a computer algebra system to evaluate the integral. 
However, a reasonable manual evaluation is the following. 
By symmetry, 


1 1 1 
/ sVT=ade=0 and [ oyTmatde=2 [ evil — x2 dx. 
-1 -1 0 


Thus : 
2 
E[X] = = | a1 — 2? dz. 
7 Jo 


Let x = sin@. Then dz = cos@d0. When 6 = 0, x = 0. When 6 = 7/2, 
x =1. Hence 


9) a /2 1 a /2 
E[X] = = | sin” 6 cos” 0d = | sin” 20 dO. 
0 T JO 


7 


Letting 0 = SY, a? = sdy, gives 


a re ae 
EX|=5- [ (sin ae-— | sin” y dy, 


The fastest way to evaluate this definite integral is to note that sin? + cos? = 
1, and sin? and cos? have equal integrals over the interval [0,7]. Hence 
Jo sin? = (1/2)7, and so 
j= i1,=} 
~ 42" 8 
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Solution (Exercise |E.6). (i) By equation (15.3), 


5 
2 
m= f° [egeraa == | e"rdr = 5, (re" — e") 


We use integration by parts to calculate the integral. 


(ii) By equation (15.3), 


a= ff Lnard =} [er ardg = te 
oo rdr ot = Fe 
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Appendix F 


Nonnegative random variables 
with zero expectation 


Let X be a nonnegative random variable such that E[X] = 0. We’ll give an 
argument to show that P(X > 0) = 0. 

If X has finite range, a direct proof is not hard, based on definition 
of expected values for finite-range random variables. But there is an easy 
proof for general random variables. We start by using the Markov inequality 


(equation (12.17) of Lemma/]12.12): 
id 1 

—P (x> - | <E[X] 

n n 

for every positive integer n. Since E[X] = 0, this inequality tells us that 


P (x > ‘) =0 (F.1) 


for every positive integer n. 

Since 1/n can be made as small as we like by taking large values of n, 
surely equation is enough to guarantee that P(X > 0) = 0. Isn’t it? 

Well, we’re being fussy here, but we still need a little more discussion to 
finish, as follows. 

Notice that {X > 0} is the union of the following sets: 


1 1 1 1 1 
X>I1 1>xX>- —~>X>- —->X>-—)},... 
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Remember that we always assume that we have countable additivity for our 
models (Section |14.3). Thus 


1 1 1 1 1 
P(X > 0) =P(X >1)+P (12. X>5)4P(Z2x>z)4P(Z2x>7)4.., 


and that sum is indeed zero. 
And since P(X > 0) = 0, and X is nonnegative, we must have 


P(X =0)=1-P(X>0)=1. 
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Appendix G 


Inequalities for log and 
exponential 


As usual in mathematics, log x will denote the logarithm of x using base e. 


Lemma G.1 (Basic inequalities for log and exponential). 
log(1 +2) < x for every x € (—1, 00), (G.1) 


and 
1+a < e® for every x € (—00, 00). (G.2) 


Proof. Let 
f(z) =a —log(1+ 2). 


We need to show that f(x) > 0 for all x € (—1, 00). 
Note that f(0) = 0. Also, 


1 
! = — 
Clearly, 
1l+a>1forzx>0; 1+2<1for -l<2<0. 
Hence 


f'(z) > 0 forx>0; f'(x)<Ofor -l<2<0. 
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Since f is decreasing on (—1,0) and increasing on (0,00), the minimum of f 
on (—1,00) occurs at x = 0, and so the minimum value of f on (—1, 00) is 
f(0) =0. 

Thus for x € (—1, 00), 


x —log(1+z) >0, ie. x > log(1 +z). 


This proves equation (G.1). 

The exponential function is an increasing function. Hence if we take 
exponential of both sides of an inequality we get another true inequality. 
Taking the exponential of both sides of equation gives the inequality 


of equation (G.2), for « > —1. Since e* > 0 is always true, the inequality of 
equation (G.2) also holds for a < —1. 


See Figure for the picture of the inequality in equation (G.1), and 


Figure [G.2] for the inequality in equation (G.2). 


Figure G.1: x is above log(1 + 7). The curves are tangent at x = 0. 


The function log(1 + x) grows slowly, so it seems natural that we have a 
simple upper bound for log(1 + 2). Similarly it seems natural that we have 
a simple lower bound for e*. But one sometimes needs inequalities in the 
other direction. There is probably no “neatest” way to state these. Here’s 
one example of a bound in the other direction. 
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Figure G.2: 1+ 2 is below e”. The curves are tangent at x = 0. 


Lemma G.2 (Reversed bounds). For any real number « with —} < 2, 
g— a? <log(1+-2), (G.3) 


and 
er 2 ie (G.4) 


See Figure [G.3] for a picture of the inequality in equation (G.3). 


Note that equations (G.3) and (G.4) are uninteresting when x is large. 


Proof. Since the exponential is an increasing function, the inequality in equa- 
tion (G.4) is equivalent to equation (G.3), so we only need to prove that 
inequality. 

Define f on (—}, 00) by 


f(x) =log(1+2)-a2 +2". 


We'll be finished as soon as we prove that f(a) > 0 for any real number x 
with —} <x. 
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So let’s find the minimum value of f(x) on this interval, and see if it’s 
greater than or equal to zero. 
We have 


llegar le eee. ola) 


Jo =] oe = = 
ae Mas l+2 l+a l+2 
Clearly f’(x) > 0 for x € (0,00). Thus for x > 0 we have f(x) > f(0) =0. 

Also, f’(x) has the same sign as «(1+ 2x) for x < 0. Since 1+ 2x > 0 
for c > —4, we have (1+ 2x) < 0 for —$ < a < 0. Thus f’(z) < 0 for 
—,<2<0. 


Hence the minimum value of f on (—},00) is f(0), and f(0) = 0. 


y=log(1+ 2) 


-0.8 -0.6 -0.4 -0.2 


Figure G.3: A lower bound for log(1+ 2) . 


G.1 Proving equation (17.1) again 


Proof. Since a, > ov, 


lim 6, = lim (anb,) (<.) =2-0=0. 


n—-0o noo An 
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G.1. Proving equation (17.1) again 


Hence 02an = bn + bnGn 4 0-2 =0. 
For n large enough that b, > —1, we can use equation (G.4) and equa- 
tion (G.2), giving: 


1.e. 


Since bpd, + z and b2a, — 0, 


: —p2 
lim er ¢n—bnam — e? and also lim e% = e?. 
n—- oo noo 


Thus (1 + 6,)°" is trapped between two quantities that both converge to e’, 


and so we also have 
lim (1 -+- bl = e”. 


noo 
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Appendix H 


Completing the square 


This appendix is for those who have not previously used the procedure of 
completing the square. 

Let q(x) = ax? + bx +c, with a nonzero. We will prove that q can be 
written as a(z +r)? + stuff for some number r, where stuff is a constant 
expression. 

We begin by writing 


q=a (2? + (b/a)x + (c/a)) : 


If we have x? + (b/a)x + (c/a) in the form we want, then mutiplying by a 
should be no problem. So from now on let’s just work on x? + (b/a)x+ (c/a). 
We would like r to be such that 


e+ —2+—=(2£+r)'?+ stuff, 


where stuff is a constant expression. 


Then 
b 
stuff = x? x Slee tr)? 
a a 
b 
=iU+-2 < — (a? + 2re +r’) 
a a 
b 
= oe — Ore + © — 7? (H.1) 
a a 


To ensure that stuff is constant, we need b/a = 2r, i.e. r = b/(2a). 
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Since the variable terms cancel out in equation (H.1)), we have 


Cc Cc iy 
stuf = 2-792 - (>) : 
a a 2a 


Be ae a). (H.2) 
a a 2a a 2a) ~ ; 


What have we achieved by rewriting the expression x? + me + = in this 
way? Well, now it is in the form blob” + constant stuff, where blob is just 
x+b/(2a). We can manipulate blob in just the same way that we manipulated 
x, so we have reduced the complexity of the expression. 


Hence 


Exercise H.1 (Completing the square to solve quadratics). The usual 
“quadratic formula” for solving a quadratic equation is derived by completing 
the square. 

Illustrate this approach by solving the equation 7?+x2—1 = 0, by complet- 
ing the square for the given quadratic polynomial. Do not use the quadratic 
formula. 


H.1 Solutions for Appendix 


Solution (Exercise |H.1). We want to solve x? + 2—1=0. 
”Completing the square” for this polynomial means writing 


e+a2—-1l=(x+r)?+ stuff, 
where stuff is a constant expression. 
We don’t have to remember any formulas here. Just notice that 
stuff =2?+2-1—(e+r)? = 2° +e-1-(27?4+2re+r’) = ¢-2rx—1-r’. 
(H2) 


To ensue that stuff is constant expression, we need to have 2ra = 2, i.e. 
r = 1/2. Then equation (H.3) says that 


1 5 
Ce es ee ee ee 
StU. r ri A 
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Thus completing the square gives us: 


ne 35 
2 
u+2 (=+5) Fl 


Solving 2? + 2 — 1 = 0 means solving 


i> 5 
_~) —~-~=0 
(2+5) 1 ; 


1.€. 
wi a: 
ae) a 
1.€. J 
5 
L+--=t—, 
2 
sO 
1 
pee eee 
2 2 
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Appendix I 


Distributions and cumulative 
distribution functions 


I.1 Cumulative distribution functions 


The definition of the cumulative distribution function (CDF) for a random 
variable is given in Definition |18.20| In this Appendix we continue the dis- 
cussion of CDF’s that was begun in Section [18.7] 


Exercise I.1 (The simplest CDF). Let X be the constant random variable 
defined by X(w) = c for all w. Sketch the graph of F'x. 


Exercise I.2 (Monotonicity). Using the Definition|18.20| please check that 
every distribution function is monotone increasing. 


Exercise I.3. Let X be a random variable such that c < X < d always 
holds. 
Show that Fy(t) = 0 for all t<c, and F(t) = 1 for all t > d. 
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Include the whole domain for a CDF When you are requested to find 
the formula for a CDF, please state the formula for all points on the real line. 
This may be unnecessary in many obvious cases, as exercise illustrates, 
but it is a good practice. 


Remark I.1 (Useful limits for CDFs). There are many probabilities 
which can be expressed in terms of F’y. We’ll just mention two limits: 


lim F(a) =0, dim Fy (bh) = 1; (1.1) 
—> CO 


a——oo 


By Exercise equation is obvious when X is bounded. 


Example I.2. In the case of a simple random variable with finite range, the 
cumulative distribution function is likely more complicated than it’s worth. 
But the CDF is still defined. To practice with the definition, we’ll consider 
two examples. 


(i) Consider the random variable X described in Example(9.2| Take p = 3/5, 
so that P(X = 1) =3/5 and P(X =0) = 2/5. 

Figure shows the graph of F'x. 

The definition of Fy implies that F(t) = 2/5 for all t with 0 < t < 1, 
while Fy(t) = 0 for t < 0 and F(t) = 1 for t > 1. The graph shows this. 


(ii) Consider tossing a fair coin 4 times. Let X be the number of heads 
which are obtained. 

Figure |I.2| shows the graph of F'x. 

Since the possible values of T are 0,1,2,3,4, when k <t < k+1, we have 
P(X <t)=P(X <k). Thus the definition of Fy implies that when 


k<t<k4+1, 


Fx(t) = P(T < k) = P(T =0)+...+P(T =k). The graph shows this, 
with 


In the present appendix we will mainly use CDF's for random variables 
whose distributions have probability densities. 


A72 


1.1. Cumulative distribution functions 


1.5 


1.0 EEE EEE teers 


-0.5 


Figure I.1: CDF for result of one coin toss, p = 3/5. 


Example I.3. 

Let sample space for the probability model be the interval [0,4], with 
uniform distribution P. See Figure Let X(t) = #3. 

For 0 < t < 64, Fx(t) = (1/4) * length((0,t1/*]) = ¢/3/4. By Exer- 
cise for every t < 0 we have F(t) = 0 and for every t > 64 we have 
Fy(t) =1. See Figure [I.4| 


Exercise I.4. Consider the probability model with sample space |0, 7/2] and 
probability density f(u) = sinu. Let X be the random variable on [0, 7/2] 
defined by X(u) = e”. Find the CDF of X. 


The next exercise deals with a situation where it takes more work to find 
the CDF of the random variable. But it’s doable. 


Exercise I.5 (A non-monotonic random variable). Consider the prob- 
ability model with sample space 2 equal to the interval [0,3] and uniform 
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1.0 4 i —@£$ _aiiiican 


0.54 


-0.5- 


Figure 1.2: CDF for result of four fair coin tosses 


distribution. Define X on [0,3] by X(t) = (1 —t)?. See Figure [1.5] Find the 
CDF for X. 

Since this random variable is not a monotonic function on its domain, a 
little extra care is needed in determining {X < t}. 
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Figure I.3: X(t) = ¢t? on the sample space [0, 4]. 
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Figure 1.4: F(t) = 1/4¢/8 
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0.5 1.0 15 2.0 2.5 3.0 


Figure I.5: X(t) = (1 —t)? on the sample space [0, 3]. 
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I.2 Finding a density from the CDF of a dis- 
tribution 


Lemma I.4 (Differentiating a CDF). Let X be a random variable whose 
distribution has density f. 
For every real number a, 


Fx(a) = / f(t) dt. (12) 
At any point a where f is continuous, 


F(a) = f(a). (1.3) 


Proof. Let A= {X < a}. By definition, P(A) = J), f. This is equation (L.2). 

The first statement of the Fundamental Theorem of Calculus tells us 
that when the integrand of an integral is continuous at the upper limit of 
integration, we can differentiate the integral with respect to its upper limit, 
and the result of the differentiation is the value of the integrand at the upper 
limit. Here the integrand is f and the upper limit of integration is a. This 
gives equation (I.3). 


Lemma [I.4] shows that if there exists a continuous density f for the dis- 
tribution of X, then f = FY. This is useful, but often we encounter random 
variables whose ranges have a few “bad” points, where Fy cannot be contin- 
uous. So a careful person needs a more general statement, such as the one 


which is given below by Lemma|I.5] 


Lemma I.5 (Density from CDF derivative). Let X be a random vari- 
able. 

Suppose that Fy is a continuous function. 

Suppose also F¥.(t) exists and is continuous at all points of R, except 
possibly at finitely many points s1,..., S;. 

Then Fy is a density for the distribution of X. 
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Proof. Since F'x is monotonic increasing, at any point a where F(a) exists it 
must be true that F(a) > 0 (since the derivative is the limit of the difference 
quotients). 

Consider an interval [u,v] which does not contain any bad points. The 
second statement of the Fundamental Theorem of Calculus tells us that 


/ “HOE Rw Few 


Note that 0 < Fy(u) < Fx(v) < 1. 

For an interval [u,v] which does not contain any bad points, if v increases 
to abad point s;, then [.” F(t) dt increases to a limit. The calculus definition 
of an improper integral says that 


liga PFelt)at= / Fy (t) dt, 
and by the continuity of Fy we also have 


ae Fy (v) — Fx(u) = Fx(s;) — Fx(u). 


Thus ie 
ii Fi@) dt = Fy(s;) = Feu). 


If there is a bad point s;_; adjacent to s; on the left, we can let u decrease 
to s;_;. A similar argument to the one just given shows that 


[ Fy (t) dt = Fx (sj) — Fx(sj-1)- 


g-l1 


Based on these arguments, we can now say that if [u,v] is any interval such 
that the interior contains no bad points, we have 


a Fy = Fx(v) = Fx(u). 


Next, think about an interval [u,v] such that (u,v) contains exactly one bad 
point s,. By what has already been said, we know that 


i: Fy = Fx (sx) — Fx(u) 
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and 
ip FF = Fx(v) = Fx (sx). 


Adding these two equation, we see that for any interval [u,v] whose interior 
contains at most one bad point, 


a Fy = Fx(v) — Fx(u). 


Repeating this argument a finite number of times shows that for any interval 
[u, v], 


[ Fy = Fx(v) — Fx(u). 


Thus by Definition F, is a density for the distribution of X. 

This completes the proof. But readers may recall that we extended the 
definition of a density later, in Definition In that definition, a density 
f for the distribution of X is required to satisfy 


P(X € A) = / f (1.4) 


for every event A, not just for intervals A. Do we need to check this? 

Fortunately, that requirement is automatically satisfied if equation (4) 
holds for all intervals A (see Remark [15.6). Thus no further work is required, 
and we conclude that Fy is a density for the distribution of X, in the general 
sense of Definition [15.5 


Here’s a typical application of Lemma |I.5} 


Exercise I.6. Let X be the random variable in Exercise 
Use the CDF of X to find a probability density for the distribution of X. 


Exercise I.7. In the setting of Exercise find a probability density for 
the distribution of X. 
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Exercise I.8 (Checking that the distribution determines the ex- 
pected value). In Exercise [E.3| you found E[X}. In Exercise[I.7] you found 
the distribution of X, so you can find EX] by a different calculation. Check 
that you obtain the same answer. 


1.3. Change of variable 


Let X be a random variable whose distribution has a density f on the real 
line. 

Let y be a continuous and strictly increasing function on an interval J of 
the real line, and suppose that J contains the range of X. 

Does the distribution of p(X) necessarily have a density? And if a density 
exists, how do we find it? 

The connection between the density and the distribution is of course based 
on integrating the density. For that reason, to give general answers to these 
questions we might want to use the theory of integration which is developed 
in advanced analysis courses. But we can already get useful information from 
calculus. 

Assume that f is continuous, except possibly at a finite number of bad 
points. Then, at all non-bad points, Lemma tells us that F(a) exists 
and 


Fx (a) = f(a). 


Suppose we know that y’ exists and is continuous and nonzero at every 
point of J, except possibly at a finite number of bad points. We’ll give a few 
examples to illustrate an approach based on Lemma|I.5|and the chain rule. 


(i) Suppose that y(x) = e*. Then 


F(x)(a) = F.x(a) = P(e* < a) 
The exponential function is always positive, so for a < 0, F.x(a) = 0. 
The exponential function is an increasing one-to-one function. 
Suppose that a > 0. 


For any real number 2, if x < loga then e* < e°%* = 


a. 
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The logarithm function is an increasing function on its domain. (Of 
course, it has to be an increasing function, since it is the inverse of an in- 
creasing function, but we can take its derivative to check.) 

For any real number 2, if e” < a then log e* < loga, i.e. x < loga. 

We have shown that for a > 0 we have 


{eX <a} = {X <loga}. 


Hence 
x(a) = P(X < loga) = Fx(loga). 


Thus if F. (log a) exists, by the chain rule we have 


1 0 ta<0 
F'x(a) = Fy (log a)— = , 
ete) elles Ye totes ifa> 0. 
We note 0 may be a bad point for F’,. There are at most finitely many other 
bad points. Thus Lemma says that F’y is a density for the distribution 
of X, and we have the formula for this density. 


(ii) Let p(x) = x+ 2°. Then y is continuous. Since y/(r) = 1+ 32”, we 
see that y’ exists is positive and continuous everywhere. Thus y is strictly 
increasing, and in particular y is one-to-one. 

Also limg+o. y(2) = oo and lim,4-. y(x) = —oo. Thus the range of y 
is the whole real line. 

Let @ denote the inverse function vot. 

Since y is increasing, x < O(a) implies v(x) < y(A(a)) = a. Since @ is 
increasing, p(x) < a implies 6(y(x)) < A(a), ie. x < O(a). 

We have shown that for a > 0 we have 


{p(X) <a} = {X < A(a)}. 


Hence 
Fiyx)(a) = P(X < O(a)) = Fx(A(a)). (1.5) 


The algebraic expression for 6 does not seem neat, but since y’ is never 
zero, a calculus theorem says that 6’ exists at every point. Also, since yo 
O(y) = y, the chain rule says that 


(eg o0) 0 = 1. 
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That is, 


So we can find 6’ if we need it. 
Using equation (I.5), if O(a) is not a bad point for Fy, then by the chain 
rule FYy)(a) exists, and 
Foxy (@) = Fx(0(a))6'(a) = f(0(a)) O(a). 
Thus Lemma|[.5] says f(6(a))0’(a) is a density for the distribution of p(X). 


(iii) The map z ++ 2? is one-to-one and onto R. It is an increasing function, 
and so preserves order. Thus 


Fxs(a) = P(X? < a) = P(X < a?) = Fx (a). 


For a 4 0, if a!/? is not a bad point for Fy then F4,,(a) exists and 
1 
y3(@) = F(a") 5a”. 


Thus Lemma {I.5] says f(a'/*)ta~?/ is a density for the distribution of X°. 


I.4 Converting a distribution to a uniform 


Suppose that we are interested in the distribution of S, = X; +...+ Xp 
for an independent sequence of random variables X;, where each X; has the 
distribution described in Example Thus the distribution of each X; 
is given by the probability density f, where f(x) = (1/3)x? on the interval 
[—1,2], and f is zero on the rest of the real line. 

We want to simulate X1,...,X, on a computer, in order to check the 
results we obtained in Example 

Simulating X1,...,X, means running a computer program which pro- 
duces a sequence of values v1,..., Un, that are statistically similar to a typical 
sequence of values obtained from X1,...,Xn. 

As usual, we won't discuss how to write such a computer program, but 
we will take note of the fact that it seems to be easier for the computer to 


simulate a random sequence Y;,..., Y;, where each Y; has a uniform distribu- 
tion. So to simplify the program we would like to express X,,...,Xp using 
a uniform sequence Y,,..., Yn. 
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This section shows how to do that. 
We'll start by finding the CDF of each X;, F'x,. Since F'x,(t) = P(X; < ¢), 
for t € [—1, 2] we compute: 


ru= [_tnden f 3eee-(3) 


Of course F'x,(t) = Oift < —land Fyx,(t) = lift > 2, but we will concentrate 
our attention on [—1, 2]. 

Let F' denote the restriction of Fy, to [—1, 2]. 

It is easy to check that F’ is continuous and strictly increasing, and maps 
[—1, 2] onto [0, 1] in a one-to-one fashion. 

Let y denote the inverse of F’. The map y is defined on (0, 1]. 

Finding a formula for y is not hard. Just solve u = (1/9)(t? +1) for t. 
We find that y(u) = (9u — 1), for u € [0, 1]. 


t 


1 
= —(t? +1). 
= (8 +1) 


-1 


Consider [0,1] as a sample space with uniform distribution. Let Y be a 
random variable on [0,1] defined by Y(u) = u. We are going to show that 
y(Y) has the same distribution as X;! 

To do that, we will show that Fyyy) = F'x,, and apply Lemma 


As a first step, notice that for any number 6 € [0, 1], 

P(Y < b) = length((0, b]) = b. (1.6) 
Let a € [—1,2]. We claim that: 

P(o(¥) <a) = P(Y < F(a)). (1.7) 


Indeed, since F is an increasing function, y(Y) < a holds if and only if 
F(y(Y)) < F(a), ie. if and only if Y < F(a). Thus equation holds. 

But Y has a uniform distribution on [0,1], so P(Y < F(a)) = length((0, F'(a)|) = 
F(a). 

We conclude that for a € [—1, 2], 


That is, for a € [—1, 2], 


1.5. Solutions for Chapter [I] 


We need to check this equation for other values of a. 

Since y maps into [—1, 2], if a > 2 it is always true that y(Y) < a, and 
for a < —1 it is never true that y(Y) <a. 

Hence Fyy)(a) = 0 for a < —1 and F,yy)(a) = 1 for a > 2. 

We've checked that Fi,;x)(a) = F'x,(a) in all possible cases for a, so 


Fix) = Fx, 

By Lemma [18.23] y(Y) and X; have identical distributions. 

It follows that for an independent sequence Yj,..., Y;,, where each Y; has 
the same distribution as Y, the sequence y(Y1),..., e(Yn) will have the same 
statistical properties as X1,...,Xn. 

And this tells us that to simulate X,,...,X, on a computer, just simulate 


Y\,.--; Yn, and apply the function y to each value in the output. 
Incidentally, this trick works for any random variable X which is such 
that we can find the inverse of F’y. 


I.5 Solutions for Chapter 


Solution (Exercise [f.1). For t < c, clearly {w: X(w) <t} is empty, so 
POX a) =0,16 Fri) =. 
For t > c, clearly {w: X(w) <t} =, so P(X <c)=1. 
See Figure [I.6] 


Figure I.6: CDF for a constant random variable equal to c. 


Solution (Exercise [I.2). Suppose that a <b. Then X <a => X <8, 
so {X <a} Cc {X <b}. Hence P(X <a) < P(X < B). 


485 


Chapter I. Distributions and cumulative distribution functions 


Solution (Exercise [I.3). For any t <c, {X < t} is the empty set. Hence 
PX <4) =18 Fx) =0. 
For any t > d, {X <t}=Q. Hence P(X <t)=1l,ie. Fy(t) =1. 


Solution (Exercise |I.4). It is easy to check that 


a /2 
| sint dt = 1, 
0 


so that f really is a probability density. So the problem makes sense. 

The range of X is [1, e7/?]. 

For ¢ € [1,e"/*], {X <t} = {x: 1<e*<t} = [0,logt]. Thus fort € 
ees), 


logt 
= 1 —cos(logt). 


logt 
Fx (t) = sinudu = —cosu 
0 0 


By Exercise|I.3] for every t < 1 we have F’x(t) = 0 and for every t > e”/? 
we have F’y(t) = 1. 


Solution (Exercise [I.5). It is helpful to refer to Figure while solving 

this problem. 

We notice that X is decreasing on [0,1] and increasing on [1,3]. 

The range of X is the interval (0, 4]. 

By Exercise we know that F(t) = 0 for t < 0 and Fx(t) = 1 for 
to 4, 

For u € [0,1], the values of X lie in [0,1]. For u € [1,2], the values of X 
also lie in [0,1]. For u € [2,3], the values of X lie in [1, 4]. 

The solutions of (u— 1)? =t areu=1— Vt, u=1+4 v4. 

For t <1, 


fas (a) St fae (u-1)? <t} = {u: 1-vi<usi+vih. 


Thus for t < 1, 
1 2 
P(X <t)=5 (2vi) 
For? > 1, 


eee X(u) < th = {u: w<i+vih. 
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-1 1 2 3 4 5 


Figure I.7: CDF for X(t) = (1 —t)? on the sample space (0, 3]. 


Thus for t > 1, 
P(X <1) =; (1+vi). 
The graph of F'y is shown in Figure 
Solution (Exercise [I.6). By Exercise 


0 it <0, 


Wave: foe <1 
HOS 
t(l+vi) if1<t<4, 
1 ift> 4. 
Then 
0 ift <0, 
LL : 
F(t) = ae Ti ee al 
eu ftl<¢<4, 
0 ift> 4. 


Note that F’(t) exists for all t except t = 0,1,4, and F” is continuous at 
every point except 0,1, 4. 
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By Lemma [1.5] F’ is a density for the distribution of X. 
The graph of F” is shown in Figure 


1.0 


0.5 


-0.5 


Figure I.8: distribution density for X(t) = (1—t)? on the sample space [0, 3). 


Solution (Exercise |I.7)). The function X is increasing on [0, 7/4], and has 
range [0,1//2]. Thus for t € [0,1/V2], 
Fx(t) = P(X <t) = P({u: wu € (0, 7/4),sinu < t}) 
={u: 0<u<7/4, u< arcsint}. 
Hence 
Fx(t) = {u: 0<u <arcsint} = P((0, arcsint)). 
By assumption, P is given by the density 2 cos u, so 


arcsint arcsint 


Fy (t) = V2cosudu = V2sinu =4/9 t. 


0 0 


And of course, by good old Exercise we know that F(t) = 0 for t < 0 
and F(t) = 1 for t > 1/V2. 
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Thus 
0 if t < 0, 


Psst OR 4 < 1/72. 
0 ian te 


By Lemma |I.5] the distribution for X has density Fy. 


Solution (Exercise [I.8}). In Exercise [I.7| we found that the distribution of 
X is uniform on [0, 1/2]. 
This distribution has a probability density g given by 


ae + if0<t<1/Vv2, 


0 otherwise. 


By equation (15.5), 
1/Vv2 WG t2 1/V2 1 
E[X]= f tg(har= f tV/2dt = — = —.. 
a Vil, ~ 2v2 


This agrees with the result of Exercise 
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Appendix J 


Joint distributions and 
densities 


J.1 Random vectors and joint distributions 


Suppose that two physical random variables, X and Y, are associated with 
some experiment. In a probability model for this experiment there will be 
two corresponding mathematical random variables, which we will also call X 
and Y. 

Suppose that we know the probability distribution of X. If someone asks 
us a probability question about the behavior of X, we are ready to answer 
that question. Similarly, we can answer any probability question about Y if 
we know the probability distribution of Y. 

Now suppose that we need the answer to a more complicated question, 
involving the behavior of both X and Y. For example, suppose we need 
to find P(X < Y). To find that probability we are going have to know 
something about the relationship between X and Y. 

Thinking about two or more variables at once can be complicated. One 
sees that already when studying calculus. To deal with the complexity it is 
helpful to use some systematic terminology, as in the next definition. 


Definition J.1 (Cartesian products). Let C and D be any sets. The set 
of all pairs (x,y), where x € C and y € D, is called the Cartesian product 
of C and D, and is denoted by C x D. 

We can picture the Cartesian product of two intervals, [a,b] x [c,d], as 
the rectangle whose sides are [a,b] and [c,d]. See Figure|J.1] 
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= (2, 5] x (1, 3] 


LL 
No- 
ue 


Figure J.1: (3, 1.5) is a point in [2,5] x [1,3]. 


Readers who are not familiar with Cartesian product terminology should 
note the next example. 


Example J.2. The statement 
axe2<bandc<y<d 
is exactly equivalent to the statement 
(x,y) € [a,b] x [c,d]. 


Thus 
igo) ses a band ¢< yd) = |a,)| x ledl. 


Incidentally, notice that from the definition of Cartesian product, R? = 
R x R, which fits our notation for R?. 

The general definition of a random variable (Definition (9.1), states that 
the physical meaning of a random variable for an experiment is a quantity 
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whose value depends on the outcome of the experiment, and a mathematical 
random variable is a map from a sample space to an appropriate set of values. 
Up to this point we have concentrated on real-valued random variables, but 
it is often convenient to use vector-valued random variables. 


Definition J.3 (Random vectors taking values in R”). Suppose that 
real-valued random variables X,Y are defined on the same sample space. Let 
F be the map from the sample space to R?, defined by F(w) = (X(w), Y(w)) 
for each sample point w. Then F is an R?-valued random variable. 

We refer to F' as a random vector. We tend to denote a random vector 
by one of the usual letters we employ for random variables. So we might say 
we have the random vector Z defined by Z = (X,Y). 

Actually one often refers to the random vector using sequence notation 
to list the vector, so that one just says (X,Y) rather than Z. 


Just as in the case of real-valued random variables, we write the set 
{w: (X(w),Y(w)) € S} more briefly as {(X,Y) € S}. 

By Example for any random vector (X,Y) taking values in R?, and 
any intervals [a,b] and [c, d], 


{a5 XS b+Nie< Y —d)+=4{(4,Y) &|[e,5| = lea)}. (J.1) 
More generally, for any subsets A, B of the real line, 
{X E€ASN{Y € Bh ={(X,Y) e€ Ax B}. (J.2) 
Of course we can re-express equation (J.2) as: 
{X €Aand Ye Bs={(X,Y)E€Ax B}. (J.3) 


Definition J.4 (Distribution of a random vector). Let Z = (X,Y) be 
a random vector. The probability distribution of Z is the rule that specifies 
P(Z € S), for every subset S' of R?. 

The probability distribution of Z is thus a probability set-function Q, 
defined for subsets S of R? as 


Q(S8) =P(Z€ 8). (J.4) 
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This definition is essentially the same as the definition of the distribution 
of a real-valued random variable, given in Definition A slightly different 
terminology is often used: 


Definition J.5 (Joint distribution terminology). For any real-valued 
random variables X and Y, defined mathematically on the sample space Q 
of a probability model, the “joint probability distribution” of X and Y is 
another name for the distribution of the random vector (X,Y). 


The use of the word “joint” for the distribution of the vectors is com- 
mon. It emphasizes the fact that one is dealing with two real-valued random 
variables at the same time. 


It can be shown that for a real-valued random variable X, the distribution 
of X is uniquely determined, once we know the value of P(a < X < b) for all 
intervals [a,b]. Similarly, it can be shown that the distribution of a random 
vector Z = (X,Y) is uniquely determined, once we know P(Z € R) for all 
rectangles R. We state this fact next. 


Lemma J.6 (Characterizing a distribution on R”). Let Z,W be random 
vectors taking values in R?, such that P(Z € R) = P(W € R) for every 
rectangle R. 

Then Z and W have the same distribution. 


The proof is not hard but requires technicalities, and is omitted. 


Exercise J.1. Let X and Y be random variables for some probability model. 
Show: 


P(X € A) = P((X,Y) € Ax R) and P(Y € B) = P((X,Y) € Rx B). 


(J.5) 


Equation (J.5) tells us that whenever we know the joint distribution of 
X and Y, we certainly know the distributions for X and Y separately. 
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J.2 Marginal distributions 


Definition J.7 (Marginal distributions). The separate distributions for 
X and Y are referred to as the marginal distributions associated with the 
joint distribution of X,Y. 


The adjective “marginal” is presumably used because the word “margin” 
can mean “edge”, and the values of the distributions of X and Y can be 
conveniently collected at the edges of a two-dimensional table of probabilities 
of the form P( (X,Y) = (zi, y;) ), 

Suppose that the range of X consists of the distinct values x71,...,2,, and 
the range of Y consists of the distinct values y;,...,ye. Then the range of 
(X,Y) must be included in the set of points (v;, y;), although not every pair 
(x;,y;) need be an actual value of (X,Y). 

If we know the distribution of (X,Y), then we know P( (X,Y) = (2,y)) 
for every (x,y) € R?. In particular we know P( (X,Y) = (ai, y;) ) for every 
4 

Since Y always has some value, 


(Xa) =|_ {Xk Sa, ond Y = y)}. 


j=l 
Thus j 
P(X =2;) = 5) P(X = 2; and Y = y;). (J.6) 
j=l 
Similarly 
k 
PUY =%) =) PX =z; and Y = y;). (J.7) 
i=1 


In the finite range case, equations (J.6) and (J.7) show that it is easy to 
calculate the marginal distribution if you know the joint distribution. 
For general random variables, Exercise [J.1] tells us that 


P(X €S)=P((X,Y)€SxR), P(Y €T) =P((X,Y) ERxT), (J.8) 


even though this does not necessarily give us a convenient formula. 
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Exercise J.2. Let X and Y be random variables such that P(X = 1) = 
PUY =1)=]1 24nd PX =2)=]P(Y ]2) S172: 
A possible joint distribution p for X, Y is given by four numbers: p11, P12, P21, P22; 
where pii = P(X =1,¥ = 1), po = P(A =1,¥ =2), px = P(X =2,¥ 
1), poo = P(X = 2,Y = 2). These numbers must be such that X and Y 
have the correct marginal distributions. 


(i) Find three different possible joint distributions for X,Y: a, b and c. 
Distribution a should be such that X and Y are independent. 
Display any distribution p as follows: 


X Pir P12 
P21 p22 


(ii) Let p and q be possible joint distributions for X,Y. Let t be a number 
in [0,1]. Prove that tp+(1-—t)q is also a possible joint distribution for X,Y. 


J.3 Joint and marginal densities 


Equation gives the general definition for the density of a probability 
distribution, on any space. 

Definition says that a function f is a probability density for the 
distribution of a real-valued random variable X if P(X € S) = J, f for 
subsets S of the real line. 

Now we consider a random vector (X,Y). 


Definition J.8 (Density for a joint distribution). Let X and Y be real- 
valued random variables for some probability model. Suppose that there 
exists a probability density function h on R?, such that 


P((X,Y)€8)= [ ita) ade (J.9) 


for subsets S of R?. 
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This equation uses the modern notation for integration over a set, given 
Definition [3.5] We could also write equation (J .9) in calculus notation as 


P((X,Y)€8) = / / Aaanda (J.10) 
Ss 


When equation holds for all S, we say that h is a probability density 
for the distribution of the random vector (X,Y), and we write this briefly as 
(X,Y) ~h. 

We also say that h is a probability density for the joint distribution of X 
and Y. 


Just as in the case of the real line, if it happens that all the values of 
(X,Y) lie in some subset T of R?, then the density h can be assumed to be 
zero at all points in the complement of T’. 

Lemma {J.6] can be used to justify the next fact. 


Lemma J.9 (Characterizing a distribution density on R?). Let X,Y 
be real-valued random variables, and let h be a probability density function 
on R?, such that 


P((X,Y) € R) =fn 


R 
for every rectangle R. 
Then 
P((X,Y)eS)= fn 
Ss 
for all sets S, so h is a density for the probability distribution of (X,Y). 


Lemma J.10 (Marginal density formula from joint density). Let X 
and Y be real-valued random variables whose distribution has a joint prob- 
ability density h defined on R?. Then the distribution of X has a density f 
on R and the distribution of Y has a density g on R, given by 


ie) = : "haa ay, 
pos (J.11) 


gly) = [ h(x, y) dx. 


(oe) 
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We can express equation (J.11) by saying that we obtain the probability 
density for one coordinate of (X,Y) by integrating out the variable corre- 
sponding to the other coordinate. 


Proof. Let A be any interval of R. By equation (J.5), 


P(X € A)=P((X,Y)€ Ax R) =f [r(e,yyayae 


AxR 


=[ (f_menav) av = | F(v) de, 


where f is defined as in equation (J.11). 


Since 
P(X €A)= | Fle) de, 
A 


for every interval A, f is a density for the distribution of X. 
The proof for Y is similar. 


Exercise J.3. In the setting of Exercise [E.1] find the density f for the dis- 
tribution of X. 


J.4 Joint density for independent random vari- 
ables 


Suppose that real-valued random variables X,Y are independent, and we 
know the probability distribution of X and the probability distribution of Y. 
Can we find the joint distribution for X,Y? 

When X and Y have finite or countable range, the answer is easy: 


P(x. =a 0nd ¥ =.) =Pix =7) PY =y) (12) 
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by independence. For a subset S of R*, we can find P((X,Y) € S) by 
adding up P(X = x; and Y = y;) for all (ai, y;) € S. 

The other easy case is the situation in which the distribution of each ran- 
dom variable has its own density. That’s the subject of the present section. 
We don’t assume ahead of time that there is a density for the joint distribu- 
tion of X,Y, but it turns out that there is one, and the formula is similar to 
equation (J.12)). 


Lemma J.11 (Density when X,Y are independent). Let X,Y be real- 
valued random variables which are independent. 

Suppose that f is a density for the distribution of X and g is a density 
for the distribution of Y. Let h(x, y) = f(x)g(y). Then h is a density for the 
distribution of (X,Y). 


Proof. Let J, and Jz be intervals. Then 


P( (X,Y) € Jt x Jo) = P(X € Ji)P(X € Jo) = ([ 1) ([») 
=| fresardy=f h. 
Ji x J2 


We have shown that for every rectangle R, 


P((X,Y) € R) =f 


R 


By Lemma |J.9} h is a density for the distribution of (X,Y). 


J.5 Convolutions: finding the density for the 
sum of two independent random variables 


Let X,Y be independent real-valued random variables whose distributions 
are given by densities f,g, respectively. Then a joint density h for (X,Y) is 
given by h(x, y) = f(x)g(y). 

Our goal in this section is to find a density for the distribution of X +Y. 
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Let A be an interval of the real line. Let B = {(z,y): r+ye€A}-. 
P(X+Ye¢€A)=P((X,Y) € B). 
It is easy to check from the definitions that 1p((z,y)) = la(a@+y). 


P((X,Y) €B) ) | Fo (waives / La( (,) )F(@)g(y) dy de 


. [. r Lalo + 9) f(@)a(y) dy de. 


For fixed x, change the variable in the inner integral from y to t — x. Then 
x+y=t, and dy = dt, and the double integral becomes 


[frre Jolt a)dtae= f [. 1 a(t) f(a) g(t — 2) deat 
a. _ Lai at 


where 1, is the indicator function for A (Definition |11.1) and 
@=f- fieG— ede (J.13) 


The function denoted by w in equation (J.13} is known as the convolution 
of f and g, and is denoted by f * g. 
We have shown that for any interval A of the real line, 


P(X+Y EA) =f fx attat 


Thus by Definition f *g is a density for the distribution of X + Y. 
Finding this density was our goal, so we are finished. 


J.6 Solutions for Appendix 


Solution (Exercise [J.1). For a real-valued random variable Y, to say that 
Y €R places no restriction on the value of Y. 

Thus when discussing real-valued random variables, the statements “X € 
A and Y € R” and “X € A” provide the same information. 

In set language, 


{X €Aand Y €R}={X € A}. 
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Solution (Exercise |J.2)). 


(i) The independent case always works, and that will be our choice for a. 


If we move probability mass vertically in a representation like the one 
for a, it has no effect on the marginal distribution of Y. The movement 
does affect the marginal distribution of X, but, since the distribution of X is 
obtained by summing rows, it doesn’t matter in which column the movement 
takes place. 

So let’s obtain b from a by moving mass 1/8 upward in column one, and 
compensating by moving mass 1/8 downward in column two. 


We'll obtain a different distribution c from a by moving mass 1/8 down- 
ward in column one, and compensating by moving mass 1/8 upward in col- 
umn two. 


(ii) Let r=tp+ (1—t)q. 
rig = tpig + (1 — thay. 
Then the marginal distribution for X using r is given by 
P(X =4) =7rat+ re = tpa +1 —t)¢n + pe t+ (1 — ae 
Lo 


1 


A similar computation works for the marginal distribution for Y using r. 


Solution (Exercise|J.3). We obtain the formula for f using equation (J.11). 
Thus 


= wees: je 1 2 
fe)= fo se2+a)dy= 22+ avin a 


501 


Chapter J. Joint distributions and densities 


502 


Appendix K 


More about joint distributions 


K.1 Checking independence using joint dis- 
tributions 


A model often makes significant use of more than one random variable. It is 
important to be able to tell when the random variables are independent. 

The definition of independence for random variables X, Y (Definition |12.2) 
says that real-valued random variables X,Y are independent if for any sub- 
sets S and T of R, {X € S} and {Y € T} are independent, i.e. 


P(X € S}N{Y €T})=P{X € S})PCU{Y € T}). 
In other words, X,Y are independent if for all S,7, 
P((X,Y)E€SxT)=P(X € S)P(Y €T). (K.1) 


This equation makes it clear that independence for random variables is a 
property of the joint distribution of the random variables. 


Example K.1 (A non-independence example). In the setting of Exer- 
cise [E.1] we can prove that the random variables X and Y are not indepen- 
dent. 

It seems obvious physically that X and Y cannot be independent, since 
information about the value of X can give you information about the value 
of Y. For example, knowing that X is near one tells you that Y is near zero, 
and knowing the exact value of X tells you that Y takes one of at most two 
possible values. 
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For an argument using the mathematical definition, let A = (1/2, 1) = 
B. Then A x B is a rectangle entirely outside the unit circle. 

Since P(X? + Y? <1) =1, P(Ax B)=0. 

We know that {(X,Y)<¢ Ax B} = {X € AandY € B}, so P(X € 
AandY € B) = 0. But both X and Y have densities that are posi- 
tive everywhere on (—1,1) so by integrating these densities we know that 
P(X € A) >Oand P(Y € B) >0. 

Thus P(X € Aand Y € B) = P(X € A)P(Y € B) is false. Hence 
{X € A} and {Y € B} are not independent, and so X and Y are not inde- 
pendent. 


Lemma tells us how to check efficiently for independence when the 
ranges of X and Y are finite. Now we will derive a somewhat similar criterion 
for independence when the distributions of X and Y are given by densities. 

What we want to do now is turn LemmalJ.1]|around, and prove a converse 
statement. Given a density h(x, y) for the random vector (X,Y), the idea is 
that X,Y will be independent if we can factor h into a product of a function 
of x times a function of y. 

If we can factor h in this way, say h(xz,y) = f(x)g(y), then the factors 
f,g will give us the marginal densities for X and Y. But a tiny bit of extra 
work is necessary, because it may not be true that [ f =1 and fg =1. For 
example, we could always multiply f by a million and divide g by a million, 
and obtain another perfectly correct factorization. 

So what is actually true is that the factors f(x) and g(y) will be the 
marginal densities, after we normalize them, i.e. after we multiply each 
factor by an appropriate constant to ensure that its integral is equal to one. 
The next lemma explains all this. 


Lemma K.2 (Density criterion for independence). Let X,Y be real- 
valued random variables in some probability model. Suppose that h is a 
density for the distribution of (X,Y), and and let f and g be nonnegative 
functions on R, such that h(x, y) = f(x)g(y) for all z, y. 

Then X and Y are independent random variables. Furthermore, for some 
constants c, and 9, c, f is a probability density for the distribution of X and 
Cog is a probability density for the distribution of Y. 
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Proof. By Lemma |J.10} a probability density for the distribution of X is 
given by 


| fos dy = afl), (K.2) 
where a 
a= / gly) dy. (K.3) 
Similarly a probability density for the distribution of Y is given by 
[fade = ev, (K.) 
where x 
C2 =) FCA snes (K.5) 


We note that 


= [ [ nenacar= [~ [seated 
([- Fle) ir) ([- gy) iy) = CC). (K.6) 


Let J,, Jo be intervals of the real line. Using the definitions, 


P({X1 € Ai} {Xe € Jo}) = P( (Xi, Xa) € J x a) = [ [rc y) dx dy 
J, x J2 


= | [ef@esw dx dy = | 1 9) dx dy 


J1 x Jo 


2 ( Rio ir) (fa iy) ~ P(X; € ))P(Xy € A). 


Thus the events {X1 € Ji} and {X2 € Jz} are independent events. Since 
this is true for all intervals J;, Jo, Lemma shows X 1, X2 are independent 
random variables. 
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When we want to prove that independence does not hold, it may be 
convenient to check densities at a single point. 

Since independence is defined in terms of probabilities, not densities, we 
will first have to make a connection between the value of a density at one 
point and a probability. Continuity lets us do that. 


Lemma K.3 (Density criterion for non-independence). Let X,Y be 
real-valued random variables for some probability model, such that the joint 
distribution for X,Y has a density h. 
Let f and g be densities for the distributions of X and Y respectively. 
Suppose for some (a,b) that f is continuous at a, that g is continuous at 
b, and that h is continuous at (a,b). Suppose also that h(a,b) £ f(a)g(b). 
Then X and Y are not independent. 


Proof. Assume that X and Y are independent. We will obtain a contradic- 
tion. 
By Lemma|J.11} f(x)g(y) is a density for the distribution of (X,Y). 
Thus f(x)g(y) and h(x,y) are both densities for the same distribution, 
and they differ at a point (a,b) where both of these densities are continuous. 
That can’t happen! For example, suppose f(a)g(b) > h(a,b). Let ¢ = 


f(a)g(b) — h(a, b) > 0. 
By continuity there is a disc D around (a,b) such that 


f(e)g(u) > he.y) +5 
holds everywhere on D. But then 
P((X,Y)€D)= i f(x)g(y) dx dy > | h(a, y) dx dy + sarea(D) 
D D 
=P((X,Y)¢D)+ sarea(D). 


That is a contradiction! 


Note that Lemma|K.3] gives us a convenient way to come to the conclusion 


found in Example 
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Remark K.4 (Existence of independent random variables). In math- 
ematical arguments it can be useful to know that given probability densities 
f and g, there aways exist independent mathematical random variables X 
and Y such that f is a density for the distribution of X and g is a density 
for the distribution of Y. This point seems physically obvious (just do two 
separate experiments), so you may not want to worry about it. But a purely 
mathematical argument is easy too. 

To show this, let Q be R? , and let P be the probability distribution with 
density h(x, y) = f(x)g(y). 

Let X((z,y)) = x and let Y((z,y)) = y. Using definitions one can 
show that h is a probability density for the distribution of the random vector 
ZSAXY ), 

By Lemma|K.2| X,Y are independent, f is a density for the distribution 
of X and g is a density for the distribution of Y. 


K.2 Conditional densities 


Let X and Y be random variables with a joint density h on R?. Since a 
probability density is a machine that produces probabilities when we inte- 
grate, we can calculate conditional probabilities involving X and Y using the 
standard definitions. For example, for any subsets A and B of R, 


_ P({X € A} {Y © B})_ P((X,Y) € Ax B) 
Ee P(Y € B) ~~ piv eB) 


J [h(a,y) dx dy 
_ AxB ( 


ff fh(x,y) dx dy’ 


RxB 


K.7) 


See Figure 
Using equation (K.7) to find conditional probabilities may require some 


computational work, but it does not require new ideas. However, it can 
at times be useful to assign a meaning to P(X ©€ A|Y = b), for some 
b € R, even when P(Y = b) = 0. This situation is obviously not covered by 
Definition since it would involve division by zero in equation (4.1). A 
correct. definition is given below in Definition 


507 


Chapter K. More about joint distributions 


Figure K.1: Integrate h over A x B to get P(X € A and Y € B). Integrate 
h over the horizontal strip R x B to get P(Y € B). 


One can just accept that as a definition, and then see how such a condi- 
tional probability is used, in Theorem [K.6] But before giving the statement 
of Definition [K.5]it may be of interest to readers to motivate this definition 
by considering a limit of conditional probabilities, as in equation [K.8| 


Motivating Definition |K.5 


Physically, we can think that P(X € A|Y = b) expresses the probability 
that X € A when given that Y has a value that is “close” to b, so that P(X € 
A|Y = 6) actually means P(X € A|Y ® 5b), at least in situations where 
P(Y = b) > 0. But the event {Y * b} is not precisely defined. Does it mean 
{b— .0001 < Y < 6+ .0001}, or does it mean {b — .0000001 < Y < b+ .0000001}? 

We have to ask this, because P(b— .0001 < Y < b+ .0001) may be many 
times greater than P(b— .0000001 < Y < b+ .0000001). We have to check 
mathematically that there is not a problem. 

Let’s look at the case that the distribution of (X,Y) has a continuous joint 
density h(x, y). We wish to define a suitable value for P(X € A| Y = 5b). In 
this situation let B be a small subinterval of IR, say with length 26, such that 


be B. See Figure |[K.2| 
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Bs Ax [b—6,b4+ 4 L296 


A : = 


Figure K.2: Integrate h over A x B to get P(X € A and Y € B). Integrate 
h over the horizontal strip R x B to get P(Y € B). 


We think of P(X € A| Y =b) as being such that 
P(X EA|Y =b) = P(X EC A|Y © B). (K.8) 
To make sense of equation (K.8), this approximation should be valid for any 
small interval B around b! 


We have chosen B = [b — 6,b+ 6]. Then 


P(X € Aand Y € B) 
P(Y € B) 


P(X E€A|Y EB)= 


Ly Sere ala, t) dt de 
Ge rae h(x, t) dt dx 


(K.9) 


Suppose A is bounded. The continuity of h implies that if 6 is small enough, 
for all x € A we have 


h(a, t) » h(a, bd) for all t € [b — 6,64 6]. 
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Applying this approximation to equation (K.9) gives 


J h(x, b)26 dx J, h(a, b) dx 


P(X €A|Y EB 
NE OE eds 1 1 on 


Equation (J.11) tells us that f° h(x,b) dx = g(b), where g is the density 
of Y. So 


P(X E€A|Y EB) | a: (Kx.10) 
A 

Notice that because the length 26 has been cancelled out in equation (K.10), 
the approximate value we obtained for P(X € A|Y € B) does not depend 
on the choice of B. It will be a good approximation if h is continuous and B 
is a small interval containing b. 

Based on equation (K.10), here is the precise definition of a conditional 
probability given the exact value of a random variable, when the random 
variable has a density. 


Definition K.5 (Conditional probability given exact value). Let X,Y 
be random variables for some probability model. Let h be a density for the 
joint distribution of X,Y. Let g be a density for the distribution of Y. For 
any value b € R such that g(b) > 0, and any subset A of R, a version of the 
conditional probability that X € A given Y = b is: 
h(a, b) 
g(b) 

If g(y) = 0, for convenience we will define P(X € A|Y = y) to be zero. 
Setting P(X € A| Y = 6b) = 0 when g(b) = 0 has no physical meaning. We 
are making that definition here simply to ensure that P(X € A|Y = b) is 
always defined mathematically. 

We speak of P(X € A| Y = 6) as a version of the conditional probability 


since it depends on the choice of the values for h(#,b) and g(b), and those 
values (at a few points) can depend on the choice of the density h. 


P(X € AJY =b)= | de. (K.11) 
A 


Remember that equation (K.5) is a new ae not something we can 
derive from our previous en Equation 0) suggests that P(X € 
A|Y = b) ought to be a useful concept, since it is often approximately equal 
to the probability that X € A when Y is close to b. 
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The interpretation given by equation was shown under a continuity 
assumption. However, the following theorem uses P(X € A|Y = b) in 
an expression that always has a physical meaning, without any assumption 
about continuity. 


Theorem K.6 (Total probability using exact cases). Let X,Y be ran- 
dom variables for some probability model. Let h be a density for the joint 
distribution of X,Y. Let g be a density for the distribution of Y. 

Let A be a subset of R, and let P(X € A|Y = y) be defined for all y as 
in Definition [K.5] using h and g. Then for any subset B of R, 


P((X,Y)€ Ax B) = | Pe AlY = yal) dy. (K.12) 


Proof. 


[Peealy =vaway= [ ( [Se i) alu) dy 


=a h(2,y) de dy = P((X,Y) € Ax B). 


In the proof of Theorem [K.6] notice how the arbitrary value we gave to 
P(X € A|Y = y) when g(y) = 0 only occurs in a place where it doesn’t 
matter! 

The statement of Theorem |K.6]relates P(X € A|Y =y) toa probability 
which is physically observable in principle, namely P( (X,Y) € A x B). We 
might say that our definition of P(X € A|Y = y) is valid precisely because 
it produces the correct value for P( (X,Y) € A x B). 


Remark K.7 (Total probability). Theorem and Theorem both 
express the law of total probability, in different situations. Equation (K.12) 
is the “continuous” version of equation (4.16). The role of the event C’ in 
equation is similar to the role of {X € A} in equation (K.12), while 
the role of D in equation is similar to the role of {Y € B}. The sum 
over the events D; is replaced by the integral over the “infinitesimal events” 


{Y = y}. 
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Equations (K.10) and (K.10) suggest one more definition. 


Definition K.8 (Conditional density). Let X,Y be random variables for 
some probability model, and let h be a density for the joint distribution of 
X,Y. Let g be a density for the distribution of Y. For any value y € R with 
g(y) > 0, let the conditional density fx(«|Y = y) for X given Y = y be 
defined by 


xe) =) = (K.18) 


For convenience, if g(y) = 0 let fx(a|Y = y) be defined to be zero. 


We may write fyx(z|Y = y) more briefly as f(z|Y = y), when the 
random variable X is known from the context. 

We have thought of a density as something that you integrate to obtain a 
probability. Our definition of the conditional density is consistent with this 


view, since by equation (K.10), 
P(X € ALY =) = f flel¥ =y)de. (K.14) 
A 


Recall that any function h which produces the correct values for P( (X,Y) € 
S') is an allowable density for the joint distribution. Thus there are many 
correct choices for h, and hence also for g. Is this a problem? 


Not really. Notice that although our definition of f(a|Y = y) depends 
on the choice of the densities h and g, Theorem shows that if we use 
f(«|Y = y) to calculate an observable probability, the result will not depend 
on the choice of h and g. So the non-uniqueness of fh is not a problem, as 
long as we stick to calculating observable probabilities. 


K.3 Changing variables 


Sometimes the analysis of a problem become much simpler if we define new 
variables, and re-express the problem in terms of the new variables. Of 
course, when we do that, we have to convert expressions using the old vari- 
ables into equivalent expressions using the new variables. And the conversion 
must be done correctly! 
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You may encounter probability calculations which involve a two-dimensional 
change of variables. Just to give a sense of how that works, in this section 
we will briefly consider a typical case. 

Suppose that X,Y are real-valued random variables, such that the range 
of the random vector (X,Y) is contained in some subset U of R?. Let h is a 
probability density for the distribution of the random vector (X,Y). 

Let y : U > R? be a map which is defined on U and has values in R?. 
Assume that y is one-to-one. If (x,y) € U and y(z, y) = (u,v), we can think 
of (u,v) as new coordinates for the point (x,y). 

Let (U,V) = y(X,Y). If y is one-to-one, then we can write X,Y in terms 
of U,V. So we can express everything involving X and Y in terms of U,V. 
Doing that may require a probability density k for the distribution of (U,V). 
How do we find k? 

Assume that y is one-to-one and onto, has derivatives, and so on. Make 
it as nice as you like. We just want to get the idea of how to find k from h. 

Let S be a subset of R?. The density & must be such that 


P(y(X,Y)€ 8) = [ k. (K.15) 


Let T' be the set defined by 


T={(z,y): p(a,y) € S}. 


Then saying that y(X,Y) € S is the same as saying that (X,Y) € T. That 
is; 


{p(X,Y) Ee S}={(X,Y) eT}. 
Hence 


P(y(X,Y) € 8) =P((X,Y) €T) aa: (K.16) 


Comparing equation (K.15) with equation (K.16), we see that we need k to 


be such that 
| ee | i 
s T 


This involves an old calculus topic: changing variables in an integral in the 
plane. 

It’s easier to think about functions than sets in this situation, so let’s use 
indicators (Definition to write everything in terms of functions. We 
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[ms = [rar (K.17) 


Assume that the inverse map y~! exists. Call it 0. The definition of T 
says that y(x,y) € S is equivalent to (x,y) € T, so O(u, v) € T is equivalent 
to (u,v) € S and so 


want to have 


1s = 17 o 8. 
Thus we want k to be such that 


[rer of = [rar (K.18) 


The calculus formula for changing variables in a two-dimensional integral 
says that for any integrand f, 


| ft-f [eco (K.19) 


where J denotes the Jacobian determinant of the map 6. 
If O(u,v) = (01(u, v), 02(u, v)), then the Jacobian determinant J is defined 


by 
Oo xed 
J = det ayo ap , (K.20) 
Bul2 p72 
In equation (K.19), the factor |.J| plays the role that |6’| would play in one 
dimension. 
Applying equation (K.19) with f = hl, gives us the following general 
equation: 


rar= [(rooare6) [J] , (K.21) 
After comparing equation (K.21) with equation (K.18), we see that equation 


(IK.18) will hold if 


k= (ho) [J]. (K.22) 


This is the change-of-variable formula that gives a density for the distribution 
of y(X,Y). 

Sometimes there may be a few points where the change-of-coordinates 
map y is undefined, or the inverse is not differentiable. (We’re looking at you, 
polar coordinates.) Usually we can just work with y on the rest of its domain, 
and still integrate the function in equation to get probabilities. 


514 


Appendix L 


Convolutions of functions on 
the integers 


The formula for the convolution of two functions on the real line was given in 
equation (J.13). The convolution operation for two functions on the integers 
has a formula which is similar, but simpler. We can gain some insights by 
exploring its properties. 


L.1 The general definition of convolutions of 
functions on the integers 


Let X be an integer-valued random variable. 

Then P(X = x) = 0 if x is not an integer. Let f be the function on the 
integers defined by f(n) = P(X =n). 

From the definition, f is simply the probability mass function for the 
distribution of X (Definition 9.7), with its domain restricted to the integers. 

As in equation (14.10), f has all the information contained in the distri- 
bution of X. 

We can picture a function f on the integers as a doubly infinite sequence 
of values: 


ce) J(=3),F (=2), F(=1), f(0), iid); i); f(3), eraenc 


For brevity, we will sometimes refer to any function on the integers as a 
sequence function. Then the function f defined by f(n) = P(X =n) will be 
called the sequence function for the distribution of X. 
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One of the goals of this section is to find P(X + Y = n) in terms of f 
and g. 


Lemma L.1 (The sequence function for a sum of independent ran- 
dom variables). Suppose that X and Y are independent integer-valued 
random variables, whose distributions have sequence functions f and g, re- 
spectively. Then: 


oe) 


P(X+Y =n)= S° P(X =k)P(Y =n-k) = Y© f(k)g(n—k). (L.1) 


k=—co k=—0o 


Proof. Notice that the events {X = k}, —oo < k < ow, cover all possibilities. 
Hence 


{X+Y=n}= J {X+Y=nand X =k}. 
k=—00 


The events in this union are obviously disjoint, since X(w) only has one value 
for each w. By countable additivity, 


P(X+Y=n)= 5) P(X+Y =nand X =k). 


k=—0o 


Logically the statement X + Y =n and X = k is equivalent to the statement 
X =kand Y =n—k. Hence 


P(X+Y=n)= S> P(X =kand Y =n—h). 


k=—0o 


Since X,Y are independent, P(X =k and Y =n—k)=P(X =&k)P(Y = 
n—k), and equation (L.1) follows. 


We would like to understand the general properties of sums like the ones 
in equation (L.1). The next lemma tackles some analysis connected with 
that goal. 
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Lemma L.2 (The convolution sum). Let a and £ be functions defined 
on the integers, such that 


> |a(n)| converges and S° |G(n)| converges. 


n=—CO nNn=—CO 


Then )>° _. |a(k)G(n — k)| converges for each n. 
Furthermore, the double series: 


S ( do la(k)B(n — 0) (L.2) 


is convergent. 


Proof. Since |a(k)G(n — k)| = ja(k)||B(n — k)|, we see that a(k) only ap- 
pears in this lemma as |a(k)|. Similarly 3(k) only appears as |G(k)|. So with- 
out loss of generality we can assume that a(k) > 0 for all k and b(n—k) > 0 
for all n, k. 

Since 7° _., B(k) converges, 6(k) > 0 as k too. Hence f(k) is 
bounded, i.e. there is some constant c such that 6(k) < c for all k. 

Hence a(k)3(n — k) < ca(k) for all k. Since SY, ca(k) converges, we 
know that }°7°_. a(k)G(n — k) converges also, by the comparison test. 

We also know that a series of nonnegative terms can be rearranged freely 
without altering its sum. This is also true for a doubly-indexed series. Hence 


dy, dS alk)6G) = ( > ot) (> «), 
k=—0o J=—0O k=—0o j=—-CO 
showing that this double series converges. 


Also 
Ss” SS afk) BG) = So (> a(n) 


k=—oo J=—00 k=—oo \j=—oo 


Let 7 =n —k in the inner summation, for each k. This gives 
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This shows that the double series in equation (L.2) converges. 


Definition L.3 (The convolution operation for sequence functions). 
Let a and £ be functions defined on the integers. These functions need not 
be associated with probability distributions. 

Suppose that @ and £ are such that 


a |a(n)| converges and S- |G(n)| converges. 


n=—Co nN=—CoO 


Define a function w on the integers by 


w(n) = S) a(k)B(n— k). (L.3) 


k=—0o 


Then w is referred to as the convolution of the sequence functions @ and /, 
and is denoted by a x £. 


We defined the * operation for general sequence functions. We can guess 
some of the properties of the * operation by looking at sequence functions 
for distributions. 

With that goal in mind, let X, Y, Z be independent integer-valued random 
variables, whose distributions have sequence functions a, 3,y respectively. 
Thus a(n) = P(X =n), Bn) = P(Y =n), and y(n) = P(Z =n). 

By equation (L.1), axB(n) = P(X+Y =n) and Bxa(n) = P(Y+X =n). 
This shows that 

ax B= Bea. (L.4) 


Similarly, (a * 8) * y(n) = P(X + Y)4+Z =n), and ax (8 *y)(n) = 
P(X + (Y + Z) =n). This shows that 


(ax B)*xy=ax(B*Y). (L.5) 


Equations (L.4) and (L.5) make us confident that statements (i) and (ii) 
of the following lemma hold. 


Lemma L.4 (Commutative, associative and distributive proper- 
ties). 
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(i) The convolution operation on general sequence functions is commuta- 
tive, i.e. equation (L.4) holds for sequence functions a, 3 whenever 
ye, la(n)| and $°°° |B (n)| converge. 


(ii) The convolution operation on general sequence functions is associa- 
tive, i.e. equation (L.5) holds for sequence functions a, 3, whenever 


Yin=—oo 12(7)|, Don=—oo LF(m)|, and dono 1¥(m)| converge. 


(iii) The distributive law holds for convolution of general sequence functions: 


for any sequence functions a, 6,7, whenever )°~ _. |a(n)|, 2° |B (n)I, 


n=— CoO 


and }°** _.. |y(n)| converge, 
a*(B+y7)=a*xB+axy. 
In fact, the convolution operation is bilinear (Definition |16.24): 


ax (qb + cy) =c1a* B+ oa*7, (L.6) 
(cya + CoB) ky = C1A*Y + C26 * 74. (L:7) 


Of course since equation (L.6) holds for all a, 6,7, and convolution is 


commutative, equation (L.7) is redundant here. 


The proof is much like what we’ve already seen, and is omitted. 


Lemma |L.4] can be summarized briefly by saying that we can manipulate 
expressions involving convolution in much the same way that we manipulate 
expressions involving multiplication. 


L.2 The 6, function on the integers 


For any integer a, let 6, denote the sequence function defined by 


y= f it 71 =a; (L.8) 


0 otherwise. 


Readers should be aware that the same 6, notation is used for other 
mathematical objects, especially for the “Dirac delta function” located at 
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the point a. The sequence function 0, defined here is not the same as the 
Dirac delta function, but it has some similarities, so using the same notation 
seems appropriate. 

Notice that the sequence function 6, is the sequence function for the dis- 
tribution of a constant random variable, namely the random variable which 
is equal to a everywhere. 


Exercise L.1 (Convolution with the sequence function 6,). Prove that 
for any sequence function f, 


da * f(n) = f(n—a). (L.9) 


Equation (L.9) says that convolving f with 6, shifts the values of f to 
the right by a. 
As applications of Exercise we see that 


ox f=f (L.10) 
for any sequence function f, and also 
Oa * dp = Osdes (L.11) 


Let’s try out these ideas on the binomial distribution with parameters 
VD, 

Let X1,...,Xn be independent random variables, with P(X; = 1) = p 
and P(X; = 0) =1-—p for alli. Let S, = X,+---+ Xn. 

The distribution of S,, is known to be binomial with parameters n, p, but 
suppose we are unaware of that, and wish to find the distribution of S;,. 

One can start by noting that the sequence function f; for the distribution 
of X; is very simple: f;(1) = p, f;(0) =1-—p, and f;(n) = 0 for all other n. 

In other words, 

ie = Por + (1 — p)do- (L.12) 


Using equation (L.1) and the associative property of convolution, we know 
that the sequence function g for S,, is given by 


g= fix... * fa = (pb1 + (1 — p)do)”™, (L.13) 


520 


L.3. Solutions for Appendix |L] 


where we use the notation (pd; + (1 — p)do)"” to indicate the convolution of 
n identical factors. 

Because the algebra of convolution is so similar to the algebra of multi- 
plication, we can expand the convolution product in equation using 
the binomial theorem. This gives 


n 


n n—k ox xN— 
1-0 ({)o-n) Kath a 6g 


k=0 


Using equation (L.11), 
37 ("\ka py 5 L.14 
g= ;)P (A —p) oe. (L.14) 
k=0 


Evaluating the right side of this equation at the point 7 shows at once that 


sh (") pi =p) for 7 = Ay acagt, 
0 otherwise. 


Thus g is the sequence function for the binomial distribution with parameters 
Tt, D. 

We could also reverse this argument. Suppose we wish to study the 
binomial distribution without thinking about experiments. Now start with 
the assumption that g is the sequence function for the binomial distribution 
with parameters n, p. 

The formula for the binomial distribution equation tells us that equation 
holds. The binomial theorem then shows that equation also 
holds. 


L.3 Solutions for Appendix 


Solution (Exercise |L.1}). By definition, 


da * f(n) = D> da(k)f(n — hy. 


k=—0o 


By the definition of 6,, the only surviving term in the sum on the right is 
the term with k =a. Since 6,(a) = 1, the result is f(n — a). 
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